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Abstract —We propose a new segmentation model combining common 
regularization energies, e.g. Markov Random Field (MRF) potentials, 
and standard pairwise clustering criteria like Normalized Cut (NC), av¬ 
erage association (AA), etc. These clustering and regularization models 
are widely used in machine learning and computer vision, but they were 
not combined before due to significant differences in the corresponding 
optimization, e.g. spectral relaxation and combinatorial max-flow tech¬ 
niques. On the one hand, we show that many common applications 
using MRF segmentation energies can benefit from a high-order NC 
term, e.g. enforcing balanced clustering of arbitrary high-dimensional 
image features combining color, texture, location, depth, motion, etc. On 
the other hand, standard clustering applications can benefit from an in¬ 
clusion of common pairwise or higher-order MRF constraints, e.g. edge 
alignment, bin-consistency, label cost, etc. To address joint energies 
like NC-hMRF, we propose efficient Kernel Cut algorithms based on 
bound optimization. While focusing on graph cut and move-making tech¬ 
niques, our new unary (linear) kernel and spectral bound formulations 
for common pairwise clustering criteria allow to integrate them with any 
regularization functionals with existing discrete or continuous solvers. 


1 Introduction and Motivation 

The terms clustering and segmentation are largely synonyms. The 
latter is common specifically for computer vision when data points 
are intensities or other features Ip e IZ^ sampled at regularly 
spaced image pixels p e IZ^. The pixel’s location is essential 
information. Many image segmentation methods treat Ip as a 
function / : IZ^ IZ^ and process intensity and location in 
fundamentally different ways. This applies to many regularization 
methods including discrete MRF-based techniques [1] and con¬ 
tinuous variational methods [2]. For example, such methods often 
use pixel locations to represent the segment’s geometry/shape and 
intensities to represent its appearance [3], [4], [5], [ 6 ]. 

The term clustering is often used in the general context where 
data point Ip is an arbitrary observation indexed by integer p. 
Data clustering techniques [7] are widely used outside of vision. 
Variants of K-means, spectral or other clustering methods are also 
used for image segmentation. They often join the pixel’s location 
and its feature into one combined data point in IZ^^^. We focus 
on a well-known general group of pairwise clustering methods [ 8 ] 
based on some estimated affinities between all pairs of points. 

While independently developed from different methodolo¬ 
gies, standard regularization and pairwise clustering methods are 
defined by objective functions that have many complementary 


properties reviewed later. Our goal is to combine these functions 
into a joint energy applicable to image segmentation or general 
clustering problems. Such objectives could not be combined be¬ 
fore due to significant differences in the underlying optimization 
methods, e.g. combinatorial graph cut versus spectral relaxation. 
While focused on MRF regularization, our approach to integrating 
pairwise clustering is based on a bound formulation that easily 
extends to any regularization functional with existing solvers. 

We use basic notation applicable to either image segmentation 
or general data clustering. Let (7 be a set of pixels, voxels, or any 
other points p. For example, for 2D images Vi could be a subset 
of regularly spaced points in IZ ^. Set Q could also represent data 
points indices. Assume that every p e Q comes with an observed 
feature Ip e IZ^. For example. Ip could be a greyscale intensity 
in IZ^, an RGB color in IZ^, or any other high-dimensional 
observation. If needed. Ip could also include the pixel’s location. 

A segmentation of O can be equivalently represented either as 
a labeling S := {Sp\p e Q) including integer node labels 1 < Sp < 
iT or as a partitioning {S^} of set O into K non-overlapping 
subsets or segments '.= {p e = k}. With minor abuse of 

notation, we will also use as indicator vectors in { 0 , 

We combine standard pairwise clustering criteria such as Av¬ 
erage Association (AA) or Normalized Cut (NC) [ 8 ] and common 
regularization functionals such as pairwise or high-order MRF 
potentials [1], [9]. The general form of our joint energy is 

E{S) = Ea{S) + 7 E Ec{Sc) (1) 

ceT 

where the first term is some pairwise clustering objective based on 
data affinity matrix or kernel A := \^Apq\ with Apq := A(Ip^Iq) 
defined by some similarity function ^(•,-). The second term in 
(1) is a general formulation of MRF potentials [10], [11], [12]. 
Constant 7 is a relative weight of this term. Subset c c 
represents a factor typically consisting of pixels with nearby 
locations. Factor labels Sc ’= {Sp\p e c) is a restriction of labeling 
S to c. Potentials Ec(Sc) for a given set of factors T represent 
various forms of unary, second, or higher order constraints, where 
factor size |c| defines the order. Factor features {Ip\p e c} often 
work as parameters for potential Ec. Section 1.1 reviews standard 
MRF potentials. Note that standard clustering methods encourage 
balanced segments using ratio-based objectives. They correspond 
to high-order potential Ea(S) of order |r^| in (1). Sections 1.2 
and 1.3 review standard clustering objectives used in our paper. 
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1.1 Overview of MRF regularization 

Probably the most basic MRF regularization potential corresponds 
to the second-order Potts model [10] used for edge alignment 

^ Ec{Sc) = Z [Sp * ^9] « \\dS\\ (2) 

pq€j\f 

where a set of pairwise factors = Af includes edges c = {pq} 
between pairs of neighboring nodes and [•] are Iverson brackets. 
Weight Wpq is a discontinuity penalty between p and q. It could 
be a constant or may be set by a decreasing function of intensity 
difference Ip - Iq attracting the segmentation boundary to image 
contrast edges [4]. This is similar to the image-based boundary 
length in geodesic contours [3], [5]. 

A useful bin consistency constraint enforced by the -Potts 
model [11] is defined over an arbitrary collection of high-order 
factors T. Factors c g P correspond to predefined subsets of 
nodes such as superpixels [11] or bins of pixels with the same 
color/feature [13], [14]. The model penalizes inconsistency in 
segmentation of each factor 

ZEc{Se) = ^min{r,|c|-|5cr} (3) 


combining discrete variables S = with continuous vari¬ 
ables m = representing cluster “centers”. Norm ||.|| 

denotes the Euclidean metric. For any given S the optimal centers 
arg min^ F(S,m) are the means 




^qeSk Iq 




(7) 


where \S^\ denotes the segment’s cardinality. Assuming current 
segments the update operation giving argmin^ F(S^ pst) 


(basic KM \ o ■ wt ii 

i procedure] ^ argmm ||/p - II (8) 

defines the next solution St+i as per standard K-means algorithm. 
This greedy descent technique converges only to a local minimum 
of KM objective (6), which is known to be NP hard to optimize. 
There are also other approximation methods. Below we review the 
properties of KM objective (6) independently of optimization. 

The optimal centers rrik in (7) allow to represent (6) via an 
equivalent objective of a single argument S 


E E = E 1-5^1 (9) 

k peSk k 


where T is some threshold and \Sc\* '= max/. \S^ n c\ is the 
cardinality of the largest segment inside c. Potential (3) has its 
lowest value (zero) when all nodes in each factor are within the 
same segment. 

Standard label cost [12] is a sparsity potential defined for a 
single high-order factor c = ^2. In its simplest form this potential 
penalizes the number of distinct segments (labels) in S 

EniS) = E^fc-[l^"l>0] (4) 

k 

where could be a constant or a cost for each specific label. 

Potentials (2), (3), (4) are only a few examples of regular¬ 
ization terms widely used in combination with powerful discrete 
solvers like graph cut [10], belief propagation [15], TRWS [16], 
LP relaxation [17], [18], or continuous methods [19], [20], [21]. 

Image segmentation methods often combine regularization 
with a likelihood term integrating segments/objects color models. 
For example, [4], [22] used graph cuts to combine second-order 
edge alignment (2) with a unary (first-order) appearance term 

-EEiog^’^^) (5) 

k peSk 

where {P^} are given probability distributions. Unary terms like 
(5) are easy to integrate into any of the solvers above. 

If unknown, parameters of the models {P^ } in a regularization 
energy including (5) are often estimated by iteratively minimizing 
the energy with respect to S and model parameters [23], [24], 
[25], [26], [12]. In presence of variable model parameters, (5) can 
be seen as a maximum likelihood (ML) model-fitting term or a 
probabilistic K-means clustering objective [27]. The next section 
reviews K-means and other standard clustering methods. 

1.2 Overview of K-means and clustering 

Many clustering methods are based on K-means (KM). The most 
basic iterative KM algorithm [28] can be described as the block- 
coordinate descent for the following mixed objective 

F{S,m) := E E Wlp-rukf (6) 

k peSk 


The sum of squared distances between data points {Ip\p e S^} and 
mean pgk normalized by \S^\ gives the sample variance denoted 
by var(S^). Formulation (9) presents the basic KM objective 
as a standard variance criterion for clustering. That is, K-means 
attempts to find K compact clusters with small variance. 

K-means can also be presented as a pairwise clustering criteria 
with Euclidean affinities. The sample variance can be expressed as 
the sum of distances between all pairs of the points. For example, 
plugging (7) into (9) reduces this KM objective to 


T^pqeSk \\Ip Iq\\ 


( 10 ) 


Taking the square in the denominator transforms (10) to another 
equivalent KM energy with Euclidean dot-product affinities 


c 


^ ^pqeSk {Ip^ Iq) 


( 11 ) 


Note that we use = and S for “up to additive constant” relations. 

Alternatively, K-means clustering can be seen as Gaussian 
model fitting. Eormula (5) for normal distributions with variable 
means m/. and some fixed variance 

^ log7V(/pK) (12) 

k peSk 

equals objective (6) up to a constant. 

Various extensions of objectives (6), (9), (10), (11), or (12) 
lead to many powerful clustering methods such as kernel K-means, 
average association, and Normalized Cut, see Tab.l and Eig.34. 


1.2.1 Probabilistic K-means ("pKM ) and model fitting 
One way to generalize K-means is to replace squared Euclidean 
distance in (6) by other distortion measures WWd leading to a 
general distortion energy commonly used for clustering 

E E ¥p-mk\\d- (13) 

k peSk 

The optimal value of parameter ruk may no longer correspond to 
a mean. Eor example, the optimal m/. for non-squared L 2 metric 
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(c) basic K-means 

(d) elliptic K-means 

E = 2427 ■ _ ._• 

E = 2378 

(e) GMM: local min 

(f) GMM: from gr. truth 

E = 2633 

■ 

■■ 

E = 1295 

■ ; 

■■ 

(g) K-modes ~ mean-shift 

(h) kernel K-means 


Fig. 1; Model fitting fpKMj (14) kernel K-means (kKM) 
(22). Histogram fitting converges in one step assigning initially 
dominant bin label (a) to all points in the bin (b): energy (14,15) is 
minimal at any volume-balanced solution with one label inside 
each bin [27]. Basic and elliptic K-means (one mode GMM) 
under-fit the data (c,d). Six mode GMMs over-fit (e) as in (b). 
GMMs have local minima issues; ground-truth initialization (f) 
yields lower energy (14,15). Kernel K-means (21,22) with Gaus¬ 
sian kernel k in (h) outperforms pKM with distortion \\\\f^ in (g) 
related to K-modes or mean-shift (weak /cKM, see Sec. 1.2. 3). 


is a geometric median. For exponential distortions the optimal rrik 
may correspond to modes [29], [30], see Appendix B. 

A seemingly different way to generalize K-means is to treat 
both means and covariance matrices for the normal distributions 
in (12) as variables. This corresponds to the standard elliptic K- 
means [31], [32], [12]. In this case variable model parameters 
and data points Ip are not in the same space. 
Yet, it is still possible to present elliptic K-means as distortion 
clustering (13) with “distortion” between Ip and defined by an 
operator || — ||^ corresponding to a likelihood function 

\\Ip - ^kWd ■= - '^OgN{Ip\dk)- 



(a) Input and initialization 




(b) GMM fitting in RGB (GrabCut without edges) 


1 #' 


(c) Normalized Cut in RGB 


Fig. 2: Without edge alignment (2) iterative GMM fitting [26] 
shows stronger data over-fit compared to pairwise clustering [8]. 

clustering (13) generalizes to ML model fitting objective 

E E 114- 4lU = -E E ^ogp{ip\0k) (14) 

k p€:S^ k p€:S^ 

which is (5) with explicit model parameters 0^- This formulation 
suggests probabilistic K-means^ (pKM) as a good idiomatic name 
for ML model fitting or distortion clustering (13), even though the 
corresponding parameters Oj^ are not “means”, in general. 

Probabilistic K-means (14) is used in image segmenta¬ 
tion with models such as elliptic Gaussians [31], [32], [12], 
gamma/exponential [25], or other generative models [33]. Zhu- 
Yuille [23] and GrabCut [26] use pKM with highly descriptive 
probability models such as GMM or histograms. Information 
theoretic analysis in [27] shows that in this case pKM objective 
(14) reduces to the standard entropy criterion for clustering 

(15) 

k 

where H(S^) is the entropy of the distribution for {Ip\p e S^}. 

Intuitively, minimization of the entropy criterion (15) favors 
clusters with tight or “peaked” distributions. This criterion is 
widely used in categorical clustering [34] and decision trees [35], 
[36] where the entropy evaluates histograms over “naturally” dis¬ 
crete features. However, the entropy criterion with either discrete 
histograms or continuous GMM densities has limitations in the 
context of continuous feature spaces, see Appendix C. Iterative 
fitting of descriptive models is highly sensitive to local minima 
[14], [37] and easily over-fits even low dimentional features in 
(Fig.lb,e) or in (RGB colors. Fig. 2b). This may explain 
why this approach to clustering is not too common in the learning 


. 1- 11-T name probabilistic K-means in the general clustering context was 

Similar distortion measures can be defined for arbitrary probability coined by [ 27 ]. They formulated ( 14 ) after representing distortion energy ( 13 ) 
distributions with any variable parameters 0^. Then, distortion as ML fitting of Gibbs models ^e"llII^ for arbitrary integrable metrics. 
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community. As proposed in (1), instead of entropy criterion we 
will combine MRF regularization with general pairwise clustering 
objectives Ea widely used for balanced partitioning of arbitrary 
high-dimensional features [8]. 


1.2.2 Kernel K-means (kKM) and pairwise clustering 

This section reviews pairwise extensions of K-means (11) such as 
kernel K-means (/cKM) and related clustering criteria. In machine 
learning, /cKM is a well established data clustering technique [42], 
[43], [44], [40], [45], [46] that can identify non-linearly separable 
structures. In contrast to pKM based on complex models, kKM 
corresponds to complex (nonlinear) mappings 


embedding data {Ip\p € f2} c TZ^ as points (pp = ^(/p) in 
a higher-dimensional Hilbert space l-L. The original non-linear 
problem can often be solved by simple linear separators of the 
embedded points {(ppip e Q} c H. Kernel K-means corresponds 
to the basic K-means (6) in the embedding space 

F{S,m) = Y, Y. Up-mkf. (16) 

k peS^ 


Optimal segment centers rrik corresponding to the means 


PSk - 


15^1 


(17) 


reduce (16) to fcKM energy of the single variable S similar to (9) 


F{S) = YY\\4>P-^^s4‘'■ ( 18 ) 

k peSk 


Similarly to (10) and (11) one can write pairwise clustering 
criteria equivalent to (18) based on Euclidean distances ||0(/p) - 
(/)(/g)|| or inner products (0(1^), 0(1^)), which are commonly 
represented via kernel function k(x^y) 


k{x,y) := {(j){x),(j)iy))- ( 19 ) 


The (non-linear) kernel function k(x^y) corresponds to the inner 
product in H. It also defines Hilbertian metric^ 

\\x-y\l ■■= \\4>{x) - 4>{y)f 

= k{x,x) + k{y,y)-2k{x,y) (20) 

isometric to the Euclidean metric in the embedding space. Then, 
pairwise formulations (10) and (11) for K-means in the embedding 
space (18) can be written with respect to the original data points 
using isometric kernel distance || ||^ in (20) 

Z?/C\ - ^pq^Sk \\Ip - Iq\\k 

= ?— w\ — ^ ^ 

or using kernel function /c in (19) 

F(S) ^ 

The definition of kernel k in (19) requires embedding 
Since pairwise objectives (21) and (22) are defined for any kernel 
function in the original data space, it is possible to formulate kKM 
by directly specifying an affinity function or kernel k{x^y) rather 


2. Such metrics can be isometrically embedded into a Hilbert space [47]. 


than embedding (j){x). This is typical for kKM explaining why the 
method is called kernel K-means rather than embedding K-means ^. 

Given embedding kernel function k defined by (19) is 
positive semi-definite (p.s.d), that is k{x^y) > 0 for any x^y. 
Moreover, Mercer's theorem [50] states that p.s.d. condition for 
any given kernel k(x,y) is sufficient to guarantee that k(x^y) 
is an inner product in some Hilbert space. That is, it guarantees 
existence of some embedding fix) such that (19) is satisfied. 
Therefore, kKM objectives (18), (21), (22) are equivalently defined 
either by embeddings or p.s.d. kernels k. Thus, kernels are 
commonly assumed p.s.d. However, as discussed later, pairwise 
clustering objective (22) is also used with non p.s.d. affinities. 

To optimize kKM objectives (18), (21), (22) one can use the 
basic KM procedure (8) iteratively minimizing mixed objective 
(16) explicitly using embedding 

( 23 ) 

where is the mean (17) for current segment S^. Equivalently, 
this procedure can use kernel k instead of (j). Indeed, as in 
Section 8.2.2 of [51], the square of the objective in (23) is 


P^pf^- 2(j)p pgk + ll/i^fc 11^ 


l^t"! \s^\^ 


where we use segment as an indicator vector, embedding (j) 
as an embedding matrix (j) := [0p] where points fip = fiilp) are 
columns, and ' denotes the transpose. Since the crossed term is a 
constant at p, the right hand side gives an equivalent objective for 
computing Sp in (23). Using kernel matrix 1C ’.= and indicator 
vector Ip for element p we get 


implicit \ 
kKM Sp 
procedure/ 


arg min 
k 


ok' 1 ^ nk 1 f ok 
KOf ^ Lp KOf 


(24) 


where the kernel matrix is directly determined by kernel k 


ICpq - fipfiq - {(j)p^(j)q) - k(Ip^Iq). 

Approach (24) has quadratic complexity iterations. But, it 

avoids explicit high-dimensional embeddings fip in (23) replacing 
them by kernel k in all computations, a.k.a. the kernel trick. 

Note that the implicit kKM procedure (24) is guaranteed to 
decrease pairwise kKM objectives (21) or (22) only for p.s.d. 
kernels. Indeed, equation (24) is derived from the standard greedy 
K-means procedure in the embedding space (23) assuming kernel 
(19). The backward reduction of (24) to (23) can be done only for 
p.s.d. kernels k when Mercer’s theorem guarantees existence of 
some embedding such that k(Ip^ Iq) = 

Pairwise energy (21) helps to explain the positive result 

—(I —I 

for kKM with common Gaussian kernel k = exp ^ 

Eig.l(h). Gaussian kernel distance (red plot below) 

114 - 41^ ocl - kilp, 4) = 1 - exp (25) 

is a “robust” version of Euclidean metric 
(green) in basic K-means (10). Thus, 

Gaussian kKM finds clusters with small 
local variances, Eig.l(h). In contrast, ba¬ 
sic K-means (c) tries to find good clus¬ 
ters with small global variances, which 
is impossible for non-compact clusters. 







3. This could be a name for some clustering techniques constructing explicit 
embeddings [48], [49] instead of working with pairwise affinities/kernels. 
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A. basic K-means (KM) 

(^.g. [28]) 

Ffc ^peS^ W^P 1 

Variance criterion 

y- '^pqeSk IIU“LII^ 

Efc |S'=|.™r(S''=) 




more complex more complex 

probability models data representation 


B. probabilistic K-means (pKM) 

C. kernel K-means (fcKM) 

equivalent energy formulations: 

equivalent energy formulations: 

Y.kT.pesk\\lp-^k\\d = -Y.kT.pesk^^'P{lp\^k) 

V-' V' ^ l|2 _ V' ^pqeSk W^P 

Efc EpeS'fe \\4^Dp) II ~ Efc 2|S'^| 

£ Y' '^pqeSk ^(^pAq) 

2^k l^fcj 

related examples: 

related examples: 

elliptic K-means [31], [32] 

Average Association or Distortion [38] 

geometric model fitting [12] 

Average Cut [8] 

K-modes [29] or mean-shift [39] (weak fcKM) 

Normalized Cut [8], [40] (weighted A^KM) 

Entropy criterion Yk [23], [26] 

for highly descriptive models (GMMs, histograms) 

Gini criterion Y.k |5'^| • G(S'^) [35], [41] 

for small-width normalized kernels (see Sec. 5.1) 


TABLE 1: K-means and related clustering criteria: Basic K-means (A) minimizes clusters variances. It works as Gaussian model fitting. 
Fitting more complex models like elliptic Gaussians [31], [32], [12], exponential distributions [25], GMM or histograms [23], [26] 
corresponds to probabilistic K-means [27] in (B). Pairwise clustering via kernel K-means (C) using more complex data representation. 


Average association (AA) or distortion (AD): Equivalent 
pairwise objectives (21) and (22) suggest natural extensions of 
kKM. For example, one can replace Hilbertian metric ||||^ in 
(21) by an arbitrary zero-diagonal distortion matrix D = [Dpq] 
generating average distortion (AD) energy 

E^S) 2 ( 26 ) 

reducing to kKM energy (21) for Dpq = \\Ip - Iq\\‘l. Similarly, 
p.s.d. kernel k in (22) can be replaced by an arbitrary pairwise 
similarity or affinity matrix A = [Apq] defining standard average 
association (AA) energy 

Eaa{S) (27) 


reducing to /cKM objective (22) for Apq = k(Ip^ Iq). We will also 
use association between any two segments S'^ and 


assoc(S\ S^) := ^ Apq 

,q^S3 

allowing to rewrite AA energy (27) as 


5* AS^ 


^ _ ^associS\S'^) _ ^S^'AS^ 

Eaa{b) = ^ ^ 


(28) 


(29) 


The matrix expressions in (28) and (29) represent segments as 
indicator vectors such that 5^ = 1 iff Sp = k and symbol' means 
a transpose. Matrix notation as in (29) will be used for all pairwise 
clustering objectives discussed in this paper. 

kKM algorithm (24) is not guaranteed to decrease (27) for 
improper (non p.s.d.) kernel matrix JC = A, but general AA and 
AD energies could be useful despite optimization issues. However, 
[38] showed that dropping metric and proper kernel assumptions 
are not essential; there exist p.s.d. kernels with kKM energies 


equivalent (up to constant) to AD (26) and A A (27) for arbitrary 
associations A and zero-diagonal distortions D, see Fig. 3. 

For example, for any given affinity A in (27) the diagonal shift 
trick of Roth et al. [38] generates the “kernel matrix” 

A A' 

(30) 

For sufficiently large scalar S matrix /C is positive definite yielding 
a proper discrete kernel k(Ip^Iq) = JCpq 

kiIp,Iq)-X^X^E- 

for finite set x = {Ip\p ^ It is easy to check that kKM energy 
(22) with kernel k = JCin (30) is equivalent to AA energy (27) with 
affinity A, up to a constant. Indeed, for any indicator X e {0,1}'^' 
we have X'X = I'X implying 

X'lCX X'AX X'A'X X'X X'AX 

I'X ■ 2(1'X) 2(1'X) I'X ■ I'X ■ 

Also, Section 3.1 uses eigen decomposition of /C to construct 
an explicit finite-dimensional Euclidean embedding"^ fp e 71^^^ 
satisfying isometry (20) for any p.d. discrete kernel k = JC. 
Minimizing kKM energy (18) over such embedding isometric to 
JC in (30) is equivalent to optimizing (22) and, therefore, (27). 

Since average distortion energy (26) for arbitrary D is equiva¬ 
lent to average association for A = - y, it can also be converted to 
kKM with a proper kernel [38]. Using the corresponding kernel 
matrix (30) and (20) it is easy to derive Hilbertian distortion 
(metric) equivalent to original distortions D 

\\Ip-Iq\\l:=^f^ + 2S{l-l'-I). (31) 

4. Mercer’s theorem is a similar eigen decomposition for continuous p.d. 
kernels k(x, y) giving infinite-dimensional Hilbert embedding (/)((r). Discrete 
kernel embedding (l)p = in Sec. 3.1 (56) has finite dimension |f2|, which 

is still much higher than the dimension of points Ip, e.g. iD for colors. Sec. 3.1 
also shows lower dimensional embeddings ^p approximating isometry (20). 
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average 

association 


-E 




const (S) 

+ ^ App 

pen 


pair-wise similarities Ap^ 

symmetry or p.d. are unnecessary 


NOTE : c.p.d. A will give 
non-negative distortions Dpg 


kernel 
K-means 

EE \\(l)p - flsk 

k pes^ 

embedding 0 in Hilbert space 

_ ^ ^pqeS'‘ W^p ~ 

= ^ 215^1 

EpqgS*= (*^P’ *^ 9 ) 

fc |5''| 

correspond to Hilbertian metric and p.d. kernel 
in the original data space 



pair-wise distortions Dp^ 

assuming zero diagonal Dpp = 0 
but symmetry and non-negativity are not needed 
[Roth, PAMI’03] 


Fig. 3: Equivalence of pairwise clustering methods: kernel K-means (/cKM), average distortion (AD), average association (AA) based 
on Roth et al. [38], see (30), (31). Equivalence of these methods in the general weighted case is discussed in Appendix A (Eig. 33). 


Eor simplicity and without loss of generality, the rest of the 
paper assumes symmetric affinities A = A' since non-symmetric 
ones can be equivalently replaced by ^ . However, we do not 

assume positive definiteness and discuss diagonal shifts, if needed. 

Weighted kKM and weighted AA: Weighted K-means [28] is 
a common extension of KM techniques incorporating some given 
point weights w = {wp\p e Cl}. In the, context of embedded points 
<pp it corresponds to weighted kKM iteratively minimizing the 
weighted version of the mixed objective in (16) 

F"’(S',m) ■■= Y, 11 u)p\\(pp-mkf. (32) 

k peS^ 


Optimal segment centers rrik are now weighted means 

Wq<i)q ^ (j,W 
Wq ~ W'S^ 


(33) 


where the matrix formulation has weights represented by column 
vector w e and diagonal matrix W := diag{w). Assuming 
a finite dimensional data embedding (pp e this formulation 
uses embedding matrix p ’.= [0^] with column vectors pp. This 
notation implies two simple identities used in (33) 

Wq = w'S^ and ^ Wqpp = pW. (34) 

qeS^ qeS^ 


Inserting weighted means (33) into mixed objective (32) produces 
a pairwise energy formulation for weighted kKM similar to (22) 


F^(S) 


E E wp\\(pp- 

k peS^ 


-E 

k 

-E 


'ZpqeS^ ^p^q^pq 

S^'WJCWS^ 


(35) 

(36) 


where p.s.d kernel matrix JC = p'p corresponds to the dot products 
in the embedding space, i.e. JCpq = pppq. 

Replacing the p.s.d. kernel with an arbitrary affinity matrix A 
defines a weighted AA objective generalizing (27) and (29) 


EZ{S) 


-E 


S’^'WAWS’^ 

w'S^ 


(37) 


Weighted AD can also be defined. Equivalence of /cKM, AA, and 
AD in the general weighted case is discussed in Appendix A. 

Other pairwise clusteing criteria: Besides AA there are 
many other standard pairwise clustering criteria defined by affinity 
matrices A = [Apg]. Eor example. Average Cut (AC) 


Eac{S) 


^assoc{S’^,S’^) 

S’^\D-A)S'^ 




(38) 


where D := diag{d) is a degree matrix defined by node degrees 
vector d := Al. The formulation on the last line (38) comes from 
the following identity valid for Boolean X e {0,1}'^' 

X'DX = X'd. 


Normalized Cut (NC) [8] in (39) is another well-known 
pairwise clustering criterion. Both AC and NC can also be reduced 
to kKM [38], [52], [40]. However, spectral relaxation [8] is more 
common optimization method for pairwise clustering objectives 
than iterative kKM procedure (24). Due to popularity of NC and 
spectral relaxation we review them in a dedicated Section 1.3. 


w'S^ 


1.2.3 Pairwise vs. pointwise distortions 

Equivalence of kKM to pairwise distortion criteria (26) helps to 

juxtapose kernel K-means with probabilistic K-means (Sec. 1.2.1) 

















7 


from one more point of view. Both methods generalize the basic 
K-means (6), (10) by replacing the Euclidean metric with a more 
general distortion measure || While pKM uses “pointwise” for¬ 
mulation (13) where || \\d measures distortion between a point and 
a model, /cKM uses “pairwise” formulation (21) where || = || ||^ 

measures distortion between pairs of points. 

These two different formulations are equivalent for Euclidean 
distortion {i.e. basic K-means), but the pairwise approach is strictly 
stronger than the pointwise version using the same Hilbertian 
distortion ||||^ = ||||^ in non-Euclidean cases (see Appendix B). 
The corresponding pointwise approach is often called weak kernel 
K-means. Interestingly, weak /cKM with standard Gaussian kernel 
can be seen as K-modes [29], see Eig. 1(g). Appendix B also de¬ 
tails relations between K-modes and popular mean-shift clustering 
[39]. An extended version of Table 1 including weighted KM and 
weak kKM techniques is given in Eigure 34. 


The weighted version of /cKM procedure (24) detailed in Appendix 
A minimizes weighted AA (37) only for p.s.d. affinities, but 
positive definiteness of A is not critical. Eor example, an extension 
of the diagonal shift (30) [38] can convert NC (41) with arbitrary 
(symmetric) A to an equivalent NC objective with p.s.d. affinity 

}C = A + 6-D (43) 

using degree matrix D := diag(d) = W and sufficiently large 6. 
Indeed, for indicators X e {0,1}'^' we have X'DX = d'X and 

X'JCX X'AX .X'DX X'AX 
d'X ~ d'X d'X ~ d'X ^ ' 

Positive definite JC (43) implies p.d. affinity (42) of weighted AA 

IC = D~^K.D^^ = D~^AD~^ + 6D~^. (44) 

The weighted version of kKM procedure (24) for this p.d. kernel 
[58] greedily optimizes NC objective (41) for any (symmetric) A. 


1.3 Overview of Normalized Cut and spectral clustering 

Section 1.2.2 has already discussed /cKM and many related pair¬ 
wise clustering criteria based on specified affinities A = 

This section is focused on a related pairwise clustering method. 
Normalized Cut (NC) [8]. Shi and Malik [8] also popularized 
pairwise clustering optimization via spectral relaxation, which is 
different from iterative K-means algorithms (23) (24). Note that 
there are many other popular optimization methods for different 
clustering energies using pairwise affinities [53], [54], [55], [56], 
[57], which are outside the scope of this work. 


1.3.1 NC objective and its reiation to AA and /cKM 
Normalized Cut (NC) energy [8] can be written as 


Enc(S)-.= -Z 

k 


assoc(S’^,S’^) 
assoc{Q, S^) 


- -E 


k 


S^'as’^ 

I'ASf' 


(39) 


where association (28) is defined by a given affinity matrix A. 
The second matrix formulation above shows that the difference 
between NC and A A (29) is in the normalization. A A objective 
uses denominator V= |5'^|, which is k-th segment’s size. NC 
(39) normalizes by weighted size. Indeed, using d:= A'1 

I'AS'^ = d'S’^ = Y. dp 


where weights d= {dp\p e are node degrees 


Xl ^PQ' 


For convenience, NC objective can be formatted similarly to (29) 


Enc{S) 


-Y 


Sk'AS'^ 

d'Sk 


(41) 


Eor some common types of affinities where dp const, 
e.g. iT-nearest neighbor (KNN) graphs, NC and A A objectives 
(41) and (29) are equivalent. More generally, Bach & Jordan [52], 
Dhillon et al. [40] showed that NC objective can be reduced to 
weighted A A or kKM with specific weights and affinities. 

Our matrix notation makes equivalence between NC (41) and 
weighted A A (37) straightforward. Indeed, objective (41) with 
affinity A coincides with (37) for weights w and affinity A 


1.3.2 Spectrai relaxation and other optimization methods 
There are many methods for approximate optimization of NP-hard 
pairwise clustering energies besides greedy K-mean procedures. In 
particular, Shi, Malik, and Yu [8], [59] popularized spectral relax¬ 
ation methods in the context of normalized cuts. Such methods 
also apply to AA and other problems [8]. Eor example, similarly 
to [59] one can rewrite AA energy (27) as 

Eaa{S)^-ir{Z'AZ) for Z := 

where Z is a |^2| x iT matrix of normalized indicator vectors . 
Orthogonality (S^yS^ = 0 implies Z'Z = Ik where Ik is an 
identity matrix of size K x K. Minimization of the trace energy 
above with relaxed Z constrained to a “unit sphere” Z'Z = is a 
simple representative example of spectral relaxation in the context 
of AA. This relaxed trace optimization is a generalization of 
Rayleigh quotient problem that has an exact closed form solution 
in terms of K largest eigenvectors for matrix A. This approach 
extends to general multi-label weighted AA and related graph 
clustering problems, e.g. AC and NC [8], [59]. The main computa¬ 
tional difficulties for spectral relaxation methods are explicit eigen 
decomposition for large matrices and integrality gap - there is a 
final heuristics-based discretization step for extracting an integer 
solution for the original combinatorial problem from an optimal 
relaxed solution. Eor example, one basic discretization heuristic is 
to run K-means over the row-vectors of the optimal relaxed Z. 

Other optimization techniques are also used for pairwise 
clustering. Buhmann et al. [60], [38] address the general AD 
and AA energies via mean field approximation and deterministic 
annealing. It can be seen as a fuzzy version^ of the kernel K- 
means algorithm. In the context of normalized cuts Dhillon et al. 
[40], [58] use spectral relaxation to initialize /cKM algorithm. 

In computer vision it is common to combine various con¬ 
straints or energy terms into one objective function. Similar efforts 
were made in the context of pairwise clustering as well. Eor exam¬ 
ple, to combine /cKM or NC objectives with Potts regularization 
[62] normalizes the corresponding pairwise constraints by cluster 
sizes. This alters the Potts model to fit the problem to a standard 
trace-based formulation. 

Adding non-homogeneous linear constraints into spectral re¬ 
laxation techniques also requires approximations [63] or model 



w = d = A'l and A = W~^AW~\ 


(42) 


5. Fuzzy version of K-means in Duda et al. [61] generalizes to A^KM. 












modifications [64]. Exact optimization for the relaxed quadratic 
ratios (including NC) with arbitrary linear equality constraints is 
possible by solving a sequence of spectral problems [65]. 

Our bound optimization approach allows to combine many 
standard pairwise clustering objectives with any regularization 
terms with existing solvers. In our framework such pairwise 
clustering objectives are interpreted as high-order energy terms 
approximated via linear upper bounds during optimization. 

1.4 Our approach summary 

We propose to combine standard pairwise clustering criteria such 
as AA, AC, or NC with standard regularization constraints such as 
geometric or MRF priors. To achieve this goal, we propose unary 
(linear) bounds for the pairwise clustering objectives that are easy 
to integrate into any standard regularization algorithms. Below we 
summarize our motivation and our main technical contributions. 

1.4.1 Motivation and Reiated work 

Due to significant differences in existing optimization methods, 
pairwise clustering (e.g. NC) and Markov Random Fields (MRF) 
techniques are used separately in many applications of vision and 
learning. They have complementary strengths and weaknesses. 

For example, NC can find a balanced partitioning of data 
points from pairwise affinities for high-dimensional features [ 8 ], 
[ 66 ], [67]. In contrast, discrete MRF as well as continuous reg¬ 
ularization methods commonly use probabilistic K-means [27] or 
model fitting to partition image features [24], [23], [26], [12]. Clus¬ 
tering data by fitting simple parametric models seems intuitively 
justified when data supports such simple models, e.g. Gaussians 
[24] or lines/planes [12]. But, clustering arbitrary data by fitting 
complex models like GMM or histograms [23], [26] seems like 
an idea bound to over-fit the data. Indeed, over-fitting happens 
even for low dimensional color features [14], see also Fig.l(b,e) 
and Fig.2(b). Our energy (1) allows a general MRF framework to 
benefit from standard pairwise clustering criteria, e.g. NC, widely 
used for complex data. In general, kernel-based clustering methods 
are a prevalent choice in the learning community as model fitting 
{e.g. EM) becomes intractable for high dimensions. We show 
potent segmentation results for basic formulations of energy ( 1 ) 
with higher-dimensional features like RGBXY, RGBD, RGBM 
where standard MRF methods using model-fitting fail. 

On the other hand, standard NC applications can also benefit 
from additional constraints [63], [65], [ 68 ]. We show how to add a 
wide class of standard MRF potentials. For example, standard NC 
segmentation has weak alignment to contrast edges [67]. While 
this can be addressed by post-processing, inclusion of the standard 
pair-wise Potts term [10], [4] offers a principled solution. We 
show benefits from combining NC with lower and higher-order 
constraints, such as sparsity or label costs [12]. In the context of a 
general graph clustering, higher-order consistency terms based on 
a -Potts model [11] also give significant improvements. 

The synergy of the joint energy (1) is illustrated by juxtaposing 
the use of the pixel location information (XY) in standard NC 
and MRF techniques. The basic pairwise Potts model typically 
works on the nearest-neighbor pixel grids A /4 or A/g where XY 
information helps contrast/edge alignment and enforces “smooth” 
segment boundaries. Wider connectivity Potts leads to denser 
graphs with slower optimization and poorer edge localization. In 
contrast, common NC methods [ 8 ] augment pixel color features 
with location using relatively wide bandwidth for the XY dimen¬ 
sion. This encourages segments with spatially “compact” regions. 


Narrower XY kernels improve edge alignment [67], but weaken 
regional color/feature consistency. On the other hand, very large 
XY kernels produce color-only clustering with spatially incoher¬ 
ent segments. Combining regional color consistency with spatial 
coherence in a single NC energy requires a compromise when 
selecting XY bandwidth. Our general energy (1) can separate the 
regional consistency {e.g. NC clustering term) from the boundary 
smoothness or edge alignment {e.g. Potts potential). Interestingly, 
it may still be useful to augment colors with XY in the NC term 
in (1) since XY kernel bandwidth a allows to separate similar- 
appearance objects at distances larger than a, see Sec. 6 .2.3. 

Adding high-order term Ea to MRF potentials in (1) differs 
from fully-connected CRF [69]. Fike many MRF/CRF models 
the method in [69] lacks balancing. It is a denser version of the 
pairwise Potts model giving a trivial solution without a data term. 
In contrast, ratio-based objectives Ea are designed for balanced 
clustering. In fact, our unary bounds for such objectives allow to 
combine them with fully connected CRF or any other MRF/CRF 
model with known solvers, e.g. mean field approximation [69]. 

Some earlier work also motivates a combination of pairwise 
clustering {e.g. NC) with MRF potentials like the Potts model 
[62]. They alter the Potts model to fit it to the standard trace-based 
formulation of NC. Our general bound approach can combine 
many pairwise clustering methods with any solvable discrete or 
continuous regularization potentials. 

1.4.2 Main contributions 

We propose a new energy model (1) for multi-label image seg¬ 
mentation and clustering, efficient bound optimization algorithms, 
and demonstrate many useful applications. Our preliminary results 
appear in [41] and [70]. Our main contributions are: 

• We propose a general multi-label segmentation or clus¬ 
tering energy ( 1 ) combining pairwise clustering {e.g. NC) 
with standard second or higher-order MRF regularization 
potentials. A pairwise clustering term can enforce balanced 
partitioning of observed image features and MRF terms 
can enforce many standard regularization constraints. 

• We obtain kernel (exact) and spectral (approximate) 
bounds for common pairwise clustering criteria providing 
two auxiliary functions for joint energy (1). In the context 
of standard MRF potentials {e.g. Potts, robust -Potts, 
label cost) we propose move-making algorithms for energy 
( 1 ) generalizing ^-expansion and aP-swap moves^. 

• For our spectral bouncf we derive a new low-dimensional 
data embedding fip minimizing isometry errors w.r.t. 
affinities. These optimal embeddings complete the gap 
between standard KM discretization heuristics in spectral 
relaxation [ 8 ] and the exact kKM formulations [52], [40]. 

• Our experiments demonstrate that typical NC applications 
benefit from extra MRF constraints, as well as, MRF seg¬ 
mentation benefit from the high-order NC term encourag¬ 
ing balanced partitioning of image features. In particular, 
our NC-i-MRF framework works for higher-dimensional 
features {e.g. RGBXY, RGBD, RGBM) where standard 
model-fitting clustering [23], [26], [12] fails. 

6. Our bounds for pairwise criteria can be also integrated into auxiliary 
functions with other standard regularization potentials (truncated, cardinality, 
TV) addressed by discrete {e.g. message passing, relaxations, mean-field 
approximations) or continuous {e.g. convex, primal-dual) algorithms. 

7. Our term spectral bound means “spectral” auxiliary function in the 
context of bound optimization, not to be confused with bounds on eigenvalues. 
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Fig. 4: Iteration t of the general bound optimization procedure for 
function E{S) using auxiliary functions (bounds) at{S). Step I 
minimizes at{S). Step II computes the next bound at+i{S). 

The rest of the paper is organized as follows. Sections 2 
and 3 present our kernel and spectral bounds for (1) and detail 
combinatorial move making graph cut algorithms for its optimiza¬ 
tion. Section 4 discusses a number of extensions for our bound 
optimization approach. Sections 5 analyses kernel and bandwidth 
selection strategies. Section 6 presents many experiments where 
either pairwise clustering criteria benefit from the additional 
MRF constraints or common MRF formulations benefit from an 
additional clustering term for various high-dimensional features. 

2 Kernel Bounds 

First, we review the general bound optimization principle and 
present basic K-means as an example. Section 2.2 derives kernel 
bounds for standard pairwise clustering objectives. Without loss 
of generality, we assume symmetric affinities ^4 = ^4' since non- 
symmetric ones can be equivalently replaced by ^ , e.g. see 

(30) in Sec. 1.2.2. Positive definiteness of A is not assumed and 
diagonal shifts are discussed when needed. Move-making bound 
optimization for energy (1) is discussed in Section 2.3. 

2.1 Bound optimization and K-means 

In general, bound optimizers are iterative algorithms that optimize 
auxiliary functions (upper bounds) for a given energy E{S) 
assuming that these auxiliary functions are more tractable than the 
original difficult optimization problem [71], [72], [73], [37]. Let t 
be a current iteration index. Then at(S) is an auxiliary function 
of E(S) at current solution St if 

E{S) < at{S) yS (45a) 

E{St) = at(St). (45b) 

The auxiliary function is minimized at each iteration t (Fig. 4) 

St+i = argnnnat(5'). (46) 

This procedure iteratively decreases original function E(S) since 

< at(St^i) < at(St) = E{St). 

We show that standard KM procedures (23), (24) correspond 
to bound optimization for K-means objective (18). Note that 
variables rrik in mixed objective E{S^m) (16) can be seen as 


Fig. 5: K-means as linear bound optimization: As obvious from 
formula (49), the bound at(S) in Theorem 1 is a unary function 
of S. That is, KM procedures (23,24) correspond to optimization 
of linear auxiliary functions at(S) for KM objectives. Optimum 
St+i is finite since optimization is over Boolean e {0, 

relaxations of segment means (17) in single-variable KM 
objective E{S) (18) since 

= argmin Y, Up-nikf 
and E(S) = min E(S^m). (47) 

m 

Theorem 1 (bound for KM). Standard iterative K-means pro¬ 
cedures (23,24) can be seen as bound optimization methods for 
K-means objectives E{S) (18,22) using auxiliary function 

at(S) = E(S,pt) (48) 

at any current segmentation St = {St} with means pt = il^s^ }• 

Proof Equation (47) implies at{S) > E(S). Since at{St) = 
E(St) then at(S) is an auxiliary function for E(S). Re¬ 
segmentation step (23) gives optimal segments St+i minimizing 
the bound at(S). The re-centering step minimizing E(St+i^m) 
for fixed segments gives means pt+i defining bound at+i(S) for 
the next iteration. These re-segmentation (I) and re-centering (II) 
steps are also illustrated in Fig. 4 and Fig. 5. □ 

Theorem 1 could be generalized to probabilistic K-means [27] 
by stating that block-coordinate descent for distortion clustering 
or ML model fitting (14) is a bound optimization [37], [41]. 
Theorem 1 can also be extended to pairwise and weighted versions 
of KM. For example, one straightforward extension is to show that 
E^(S^pf) (32) with weighted means pf = {l^^k} (33) is a 
bound for weighted KM objective E^(S) (35), e.g. see Th. 6 in 
Appendix A. Then, some bound for pairwise wkKM energy (36) 
can also be derived (Cor. 1, Appendix A). It follows that bounds 
can be deduced for many pairwise clustering criteria using their 
reductions to various forms of /cKM reviewed in Sec. 1.2.2 or 1.3.1. 

Alternatively, the next Section 2.2 follows a more direct and in¬ 
tuitive approach to deriving pairwise clustering bounds motivated 
by the following simple observation. Note that function at(S) in 
Theorem 1 is unary with respect to S. Indeed, functions E(S^m) 
(16) or F^(5', m) (32) can be written in the form 

F{S,m) E Up-mkf s'; (49) 

k P 

F^(S,m) = YJl '^pW^p ~ (50) 

k p 

highlighting the sum of unary terms for variables Sp. Thus, 
bounds for KM or weighted KM objectives are modular (linear) 
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objective 

Ea(S) 

matrix formulation 

e(X)inEfce(5''=) 

concave relaxation 

e{X) (51) 

K, and w 
in Lemma 1 

kernel bound 

forEA(S') at5t 

AA (29) 

X'AX 

I'X 

x'{5i+A)x 

vx 

1C = 61 + A, w = 1 

Ffc Ve(S^)' + const 

for Ve in (52) 

AC (38) 

X'(D-A)X 

VX 

x'{5i+a-d)x 

vx 

K = 51 + A-D, w = l 

NC (41) 

X'AX 

d'X 

x'{5d+A)x 

d'X 

KL = SD + A, w = d 


TABLE 2: Kernel bounds for different pairwise clustering objectives Ea{S). The second column shows formulations of these objectives 
Ea{S) = e(S^) using functions e over segment indicator vectors e {0, The last column gives a unary (linear) upper 

bound for Ea(S) at St based on the first-order Taylor approximation of concave relaxation function e : IZ^ (51). 


function of S. This simple technical fact has several useful 
implications that were previously overlooked. For example, 

• in the context of bound optimization, KM can be integrated 
with many regularization potentials whose existing solvers 
can easily work with extra unary (linear) terms 

• in the context of real-valued relaxation of indicators 
linearity of upper bound at(S) (48) implies that the 
bounded function E(S) (22) is concave, see Fig. 5. 

In Section 2.2 we confirm that many standard pairwise clustering 
objectives in Sections 1.2.2 and 1.3.1 have concave relaxations. 
Thus, their linear upper bounds easily follow from the correspond¬ 
ing first-order Taylor expansions, see Figure 5 and Table 2. 



2.2 Kernel Bounds for AA, AC, and NC 


The next lemma is needed to state a bound for our joint energy (1) 
in Theorem 2 for clustering terms A A, AC, or NC in Table 2. 

Lemma 1 (concave relaxation). Consider function e : IZ^ IZ^ 
defined by matrix JC and vector w as 


e{X) 


X'JCX 
w'X ' 


(51) 


Function e{X) above is concave over region w'X > 0 assuming 
that (symmetric) matrix JC is positive semi-definite. (See Fig. 6) 


Fig. 6: Example: concave function e(X) = -^7^ X e [0,1]^. 
Note that convexity/concavity of similar rational functions with 
quadratic enumerator and linear denominator is known in other 
optimization areas, e.g. [74, p.72] states convexity of ^ for ^ > 0 

and [75, exercise 3.14] states convexity of ^ ^ 


implies linear bound Tt(X) for concave function e{X) at Xt 

Tt(X) = VeiXtYX. (53) 


Proof. Lemma 1 follows from the following expression for the 
Hessian of function e for symmetric JC 

VVe _ JC JCXw' + wX'JC wX'JCXw' 

2 ” w'X ^ (w'XV (w'XV 



which is negative semi-definite for w'X > 0 for p.s.d. JC. □ 
The first-order Taylor expansion at current solution Xt 
Tt{X) e{Xt) + Ve{XtY{X-Xt) 
is a bound for the concave function e(X) (51). Its gradient^ 

Ve(Xt) = w - ICXt^ (52) 

(w'XtE w'Xt 

8. Function e and gradient Ve are defined only at non-zero indicators Xt 
where w'Xt > 0. We can formally extend e to A = 0 and make the bound 
Tt work for e at At = 0 with some supergradient. However, At = 0 is not a 
problem in practice since it corresponds to an empty segment. 


As shown in the second column of Table 2, common pairwise 
clustering objectives defined by affinity matrix A such as A A (29), 
AC (38), and NC (41) have the form 

Ea{S) = 7e{S7 

k 


with function e(X) as in (51) from Lemma 1. However, arbitrary 
affinity A may not correspond to a positive semi-definite JC in 
(51) and e(X) may not be concave for X e 7^1^L However, the 
diagonal shift trick [38] in (30) works here too. The third column 
in Table 2 shows concave function e(X) that equals e{X) for any 
non-zero Boolean X g {0, up to a constant. Indeed, for AA 


e(X) 


X'{51 + A)X 

vx 


X'AX 

VX 


-5 = 


e{X) 


since X'X = VX for Boolean X. Clearly, (51 + A is p.s.d. for 
sufficiently large 5 and Lemma 1 implies that the first-order Taylor 
expansion Tt(X) (53) is a linear bound for concave function 
e(X). Equivalence between e and e over Booleans allows to 
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Algorithm 1: ^-Expansion for Kernel Cut 

Input : Affinity Matrix A of size |ri| x |ri|; initial labeling Sq, 
Output: : partition of the set Q 

Find p.s.d. matrix /C as in Table 2. Set t := 0; 
while not converged do 

Set at(S) to be kernel bound (55) at current partition St ; 
for each label a € C = {1, K} do 
I Find St •= arg min at(S) within one a expansion of St ; 

end 

Set f := f + 1; 

end 


use Tt{X) as a bound for e when optimizing over indicators X. 
Function e : ^ can be described as a concave relaxation 

of the high-order pseudo-boolean function e : {0,1}'^' . 

Concave relaxation e for AC in Table 2 follows from the same 
diagonal shift (51 as above. But NC requires diagonal shift 5D 
with degree matrix D = diag(d) as in (43). Indeed, 


e{X) 


X'{5D + A)X 
dfX 


X'AX 

d'X 


-5 = e{X) 


(54) 


since X'DX = X'diag{d)X = d'X for any Boolean X. Clearly, 
SD + ^ is p.s.d. for sufficiently large S assuming dp > 0 for all 
p e Cl. Concave relaxations and the corresponding Taylor-based 
bounds for EAiS) in Table 2 imply the following theorem. 


Theorem 2 (kernel bound). For any (symmetric) affinity matrix 
A and any current solution St the following is an auxiliary 
function for energy (1) with any clustering term Ea (S)from Tab.2 

at{S) = + 7 E ^c(5c) (55) 

k c^T 

where e and Ve are defined in (51), (52) and S is large enough so 
that the corresponding JC in Table 2 is positive semi-definite. 


2.3 Move-making algorithms 

Combination (55) of regularization potentials with a unary/linear 
bound for high-order term Ea(S) can be op¬ 

timized with many standard discrete or continuous multi-label 
methods including graph cuts [10], [76], message passing [77], LP 
relaxations [17], or well-known continuous convex formulations 
[19], [20], [21]. We focus on MRF regularizers (see Sec. 1.1) 
commonly addressed by graph cuts [10]. We discuss some details 
of kernel bound optimization technique using such methods. 

Step I of the bound optimization algorithm (Fig.4) using auxil¬ 
iary function at{S) (55) for energy E(S) (1) with regularization 
potentials reviewed in Sec. 1.1 can be done via move-making 
methods [10], [ ], [12]. Step II requires re-evaluation of the first 

term in (55), i.e. the kernel bound for Ea- Estimation of gradients 
We{St) in (52) has complexity 0{K\ft\^). 

Even though the global optimum of at at step I (Fig.4) is not 
guaranteed for general potentials Ec, it suffices to decrease the 
bound in order to decrease the energy, i.e. (45a) and (45b) imply 

at{St^i) < at{St) => E{St^i) < E{St)- 



Compare 

against 

# of wins 

p-value^ 

a- 

expan- 

sion 

(3^/5-swap 

135/200 

10"® 

a- 

expan- 

sion 

(T-expan¬ 
sion"^ 

182/200* 

10-34t 


^ The probability to exceed the given number of wins by random chance. 

* The algorithm stopped due to time limit (may cause incorrect number of wins). 

(b) BSDS500 training dataset 

Fig. 7: Typical energy evolution wrt different moves and frequency 
of bound updates, a-expansion updates the bound after a round of 
expansions, a-expansion^ updates the bound after each expansion 
move. Initialization is a regular 5x5 grid of patches. 


evaluation. In the case of (3(-expansion, there are at least three 
options: updating the bound after a single expansion step, or after 
a single expansion loop, or after the convergence of (T-expansion. 
More frequent bound recalculation slows down the algorithm, but 
makes the bound tighter. The particular choice generally depends 
on the trade-off between the speed and solution quality. However, 
in our experiments more frequent update does not always improve 
the energy, see Fig. 7. We recommend updating the bound after a 
single loop of expansions, see Alg.l. We also evaluated a swap 
move version of our kernel cut method with bound re-estimation 
after a complete (T(/5-swaps loop, see Fig. 7. 

3 Data Embeddings and Spectral Bounds 

This section shows a different bound optimization approach to 
pairwise clustering and joint regularization energy (1). In contrast 
to the bounds explicitly using affinity A or kernel matrices /C in 
Sec. 2.2, the new approach is based on explicit use of isometric 
data embeddings (j). While the general Mercer theorem guarantees 
existence of such possibly infinite dimensional Hilbert space 
embedding, we show finite dimensional Euclidean embedding 

(j) {(j)p] where e 0} c 


For example. Algorithm 1 shows a version of our kernel cut 
algorithm using (T(-expansion [10] for decreasing bound at{S) in 
(55). Other moves are also possible, for example a/d-swap. 

In general, tighter bounds work better. Thus, we do not run 
iterative move-making algorithms for bound at until convergence 
before re-estimating at+i. Instead, one can reestimate the bound 
either after each move or after a certain number of moves. One 
should decide the order of iterative move making and bound 


with exact isometry (19,20) to kernels JC in Table 2 and lower 
dimensional embeddings 

:= [fip] where {fip\p e 0} c TZ^ for m < |0| 

that can approximate the same isometry with any accuracy level. 
The embeddings use eigen decompositions of the kernels. 

Explicit embeddings allow to formulate exact or approximate 
spectral bounds for standard pairwise clustering objectives like 
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AA, AC, NC. This approach is very closely related to spectral 
relaxation, see Sec. 3.3. For example, optimization of our ap¬ 
proximate spectral bounds for m = K is similar to standard 
discretization heuristics using K-means over eigenvectors [8]. 
Thus, our bound optimization framework provides justification for 
such heuristics. More importantly, our spectral bounds allow to 
optimize joint energy (1) combing pairwise clustering objectives 
with common regularization terms. 

Spectral bound is a useful alternative to kernel bounds in 
Sec. 2.2 with different complexity and other numerical properties. 
In particular, spectral bound optimization using lower dimensional 
Euclidean embeddings 0 for m « |r^| is often less sensitive to 
local minima. This may lead to better solutions, even though such 
(j) only approximately isometric to given pairwise affinities. For 
m = |0| the spectral bound is mathematically equivalent to the 
kernel bound, but numerical representation is different. 

3.1 Exact and approximate embeddings (j) for kKM 

This section uses some standard methodology [78] to build the 
finite-dimensional embedding <pp = (l){Ip) with exact or approxi¬ 
mate isometry (19,20) to any given positive definite kernel k over 
finite data set {Ip\p e f]}. As discussed in Sec. 1.2.2, /cKM and 
other pairwise clustering methods are typically defined by affini¬ 
ties/kernels k and energy (22) rather than by high-dimensional 
embeddings (j) with basic KM formulation (18). Nevertheless, data 
embeddings (pp could be useful and some clustering techniques 
explicitly construct them [8], [79], [38], [48], [52], [49]. In 
particular, if dimensionality of the embedding space is relatively 
low then the basic iterative KM procedure (23) minimizing (18) 
could be more efficient than its kernel variant (24) for quadratic 
formulation (22). Even when working with a given kernel k it may 
be algorithmically beneficial to build the corresponding isometric 
embedding (j). Below we discuss finite-dimensional Euclidean 
embeddings in IZ^ {m < |(7|) allowing to approximate standard 
pairwise clustering via basic KM. 

Eirst, we show an exact Euclidean embedding isometric to a 
given kernel. Any finite data set {Ip\p e and any given kernel 
k define a positive definite kernel matrix^ 

K^pq = k(^ Ip , Iq ) 

of size \Cl\ X \Q\. The eigen decomposition of this matrix 
JC = V'AV 

involves diagonal matrix A with non-negative eigenvalues and 
orthogonal matrix V whose rows are eigenvectors, see Eig.8(a). 
Non-negativity of the eigenvalues is important for obtaining 
decomposition A = \/A • \/A allowing to define the following 
Euclidean space embedding 

4>p := \/~AVp e 7^'^' (56) 

where Vp are column of V, see Eig.8(a). This embedding satisfies 
isometry (19,20) since 

= (VAYpYiVAVg) = fCpg = k(Ip,Ig). 

Note that (56) defines a simple finite dimensional embedding 
(pp = 0(/p) only for subset of points {Ip\p e Q} in based 

9. If is given as a continuous kernel k(x,y) : x 71^ IZ matrix 1C 

can be seen as its restriction to finite data set {/p|p e c . 


large eigen values and 
corresponding vectors/dimensions 


small eigen values and 
corres-Donding vectors/di n : - 


(a) decomposition K = V'AV 



Fp G 521^1 


(b) decomposition K = {V^)'AV^ for m < |f]| 


|n|x |n| 

rank m 


K = 


|fi| xm 




A" 


mx\n\ 
Qten vector 
g en vector 
^en vector 


ym 


G 


Eig. 8: Eigen decompositions for kernel matrix /C (a) and its rank 
m approximation JC (b) minimizing Erobenius errors (57) [78]. 
Decompositions (a,b) allow to build explicit embeddings (56,59) 
isometric to the kernels, as implied by the Mercer theorem. 


on a discrete kernel, i.e. matrix JCpq. In contrast, Mercer’s theo¬ 
rem should produce a more general infinite dimensional Hilbert 
embedding 0(x) for any x e IZ^ by extending the eigen decom¬ 
position to continuous kernels k{x^y).lr\ either case, however, the 
embedding space dimensionality is much higher than the original 
data space. Eor example, <pp in (56) has dimension |D|, which is 
much larger than the dimension of data Ip, e.g. 3 for RGB colors. 

Embedding (56) satisfying isometry (19,20) is not unique. Eor 
example, any decomposition JC = G'G, e.g. Cholesky [80], defines 
a mapping (pp := Gp with desired properties. Also, rotational 
matrices R generate a class of isometric embeddings (pp '.= R(j)p. 

It is easy to build lower dimensional embeddings by weaken¬ 
ing the exact isometry requirements (19,20) following the standard 
multi-dimensional scaling (MDS) methodology [78], as detailed 
below. Consider a given rank m <\Q\ approximation JC for kernel 
matrix JC minimizing Erobenius norm errors [78] 

I|/C-V||f := (57) 

pq&fl 

It is well known [78], [80] that the minimum Erobenius error is 
achieved by 

jc = (v^Ya^v^ 

where V^ is a submatrix of V including m rows corresponding 

to the largest m eignenvalues of JC and is the diagonal matrix 

of these eigenvalues, see Eig. 8(b). The corresponding minimum 
Erobenius error is given by the norm of zeroed out eigenvalues 

+ + (58) 
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Fig. 9: Low-dimensional Euclidean embeddings (59) for m = 2 
and m = 3 in (c,d) are approximately isometric to a given affinity 
matrix (b) over the data points in (a). The approximation error 

(58) decreases for larger m. While generated by standard MDS 
methodology [78], it is intuitive to call embeddings (j) in (56) and 

(59) as (exact or approximate) isometry eigenmap or eigen isomap. 


It is easy to check that lower dimensional embedding 

:= 6 TZ^ (59) 

is isometric with respect to approximating kernel JC, that is 

= JCpq ^ JCpq. (60) 

Fig. 9 shows examples of low-dimensional approximate isometry 
embeddings (59) for a Gaussian kernel. Note that ^p e (59) 
can be obtained from (pp e (56) by selecting coordinates 
corresponding to dimensions of the largest m eigenvalues. 

According to (58) lower dimensional embedding (pp in (59) 
is nearly-isometric to kernel matrix JC if the ignored dimensions 
have sufficiently small eigenvalues. Then (59) may allow efficient 
approximation of kernel K-means. For example, if sufficiently 
many eigenvalues are close to zero then a small rank m approxi¬ 
mation JC will be sufficiently accurate. In this case, we can use a 
basic iterative K-means procedure directly in with 0(\Cl\m) 
complexity of each iteration. In contrast, each iteration of the 
standard kernel K-means (22) is 0(|(7|^) in general^^. 

There is a different way to justify approximate low¬ 
dimensional embedding (pp ignoring small eigenvalue dimensions 
in (pp. The objective in (22) for exact kernel JC is equivalent to the 
basic K-means (16) over points pp (56). The latter can be shown 
to be equivalent to (probabilistic) K-means (13) over columns Vp 
in orthonormal matrix V using weighted distortion measure 
| 0 | 

11^ “/^IIa XI “ jA^V) - \\4^p ~ 

i=l 

where index [i] specifies coordinates of the column vectors. Thus, 
a good approximation is achieved when ignoring coordinates for 

10. Without KNN or other special kernel accelerations. 


small enough eigenvalues contributing low weight in the distortion 
above. This is equivalent to K-means (16) over points (59). 


3.2 Spectral Bounds for AA, AC, and NC 

The last Section showed that /cKM clustering with given p.s.d. 
kernel JC can be approximated by basic KM over low-dimensional 
Euclidean embedding p g 7Z^ (59) with approximate isometry to 
JC (60). Below we use equivalence of standard pairwise clustering 
criteria to kKM, as discussed in Sections 1.2.2 and 1.3.1, to derive 
the corresponding low-dimensional embeddings for AA, AC, NC. 
Then, equivalence of KM to bound optimization (Theorem 1) 
allows to formulate our approximate spectral bounds for the 
pairwise clustering and joint energy (1). The results of this Section 
are summarized in Table 3. For simplicity, assume symmetric 
affinity matrix A. If not, equivalently replace A by ^ . 

Average association (AA): Diagonal shift /C = (51 + A in (30) 
converts A A (29) with A to equivalent kKM (22) with p.d. kernel 
JC. We seek rank-m approximation JC minimizing Frobenius error 
||/C - Provided eigen decomposition A = V'KV, equation 
(59) gives low-dimensional embedding (also in Tab. 3) 


= VJI™ -I- 


(61) 


corresponding to optimal approximation kernel JC. It follows that 
KM (23) over this embedding approximates AA objective (22). 
Note that the eigenvectors (rows of matrix V, Fig. 8) also solve the 
spectral relaxation for AA in Tab. 4. However, ad hoc discretization 
by KM over points VjA may differ from the result for points (61). 

Average cut (AC): As follows from objective (38) and diago¬ 
nal shift (30) [38], average cut clustering for affinity A is equiv¬ 
alent to minimizing kKM objective with kernel JC = 51 + A - D 
where 19 is a diagonal matrix of node degrees dp = Y^qApq. 
Diagonal shift is needed to guarantee positive definiteness of 
the kernel. Eigen decomposition for 19 - A = V'KV implies 
JC = V\51 - A)V. Then, (59) implies rank-m approximate 
isometry embedding (also in Tab. 3) 


pp = A51^ - (62) 


using the same eigenvectors (rows of y) that solve AC’s spectral 
relaxation in Tab. 4. However, standard discretization heuristic 
using KM over pp = may differ from the results for our ap¬ 
proximate isometry embedding pp (62) due to different weighting. 

Normalized cut (NC): According to [40] and a simple deriva¬ 
tion in Sec. 1.3.1 normalized cut for affinity A is equivalent to 
weighted kKM with kernel JC = SD~^ +19"^ AI9“^ (44) and node 
weights Wp = dp based on their degree. Weighted kKM (36) can 
be interpreted as KM in the embedding space with weights Wp 
for each point pp as in (32,33). The only issue is computing m- 
dimensional embeddings approximately isometric to JC. Note that 
previously discussed solution p in (59) uses eigen decomposition 
of matrix JC to minimize the sum of quadratic errors between 
Kpq and approximating kernel JCpq = {pp^pq). This solution may 
still be acceptable, but in the context of weighted points it seems 
natural to minimize an alternative approximation measure taking 
Wp into account. For example, we can find rank-m approximate 
affinity matrix JC minimizing the sum of weighted squared errors 

^ WpWgilCpg - iCpgf = ||D5 (/C - iC)D^\F. (63) 

pq&Q 

Substituting JC = 5D~^ + D~^AD~^ gives an equivalent objective 

\\D~^{5D + A)D~^ - D^JCD^If- 
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objective 

Ea(S) 

matrix formulation 

e(X)in Efce(5'=) 

equivalent A) KM (22,36) 
as in [38], [58] 

eigen decomposition 

V'AV = ... 

embedding in m < |f2| 

with approx, isometry (60) 

spectral bound 

for Ea{S) at St 

AA (29) 

X'AX 

VX 

K:= 61 +a 

A 

(61) 

F(S,tit) (49) 
for points ij^p 

AC (38) 

X'{D-A)X 

vx 

K:= 61 +a-D 

D-A 

4>p = (62) 

NC (41) 

X'AX 

d'X 

K.= 5D-^ +D-^AD-^ 

weighted, Wp = dp 

D~2AD~2 


(50) 

for points 4>p 


TABLE 3: Spectral bounds for objectives Ea{S). The third column shows p.d. kernel matrices JC for the equivalent /cKM energy (22). 
Eigen decomposition for matrices in the forth column defines our Euclidean embedding (j)p g (fifth column) isometric to /C (60). 
Thus, K-means over <pp approximates /cKM (22). Bounds for KM (last column) follow from Th. 1 & 6 where /it = are means 

(17) and = {p^k} are weighted means (33). Eunctions F(S,m) and F^(S^m) are modular (linear) w.r.t. S, see (49,50). 



spectral relaxation [8] 

common discretization heuristic [81] 

(embedding & K-means) 

AA 

Au = Au 

75 

II 

III 


V'KV = A 

AC 

(D - A)u = Au 

75 

II 

A 



V'AV = D- A 

NC 

(D - A)u = XDu 

4 := 



V'AV = D~^AD-^ 


TABLE 4: Spectral relaxation and discretization heuristics for objectives for pairwise clustering objectives Ea{S) for affinity A. The 
corresponding degree matrix D is diagonal with elements dp Eg To extract integer labeling from the relaxed solutions produced 
by the eigen systems (second column), spectral methods often apply basic KM to some ad hoc data embedding (j) (last column) based 
on the first K unit eigenvectors u, the rows of matrix . While our main text discusses some variants, the most basic idea [8], 
[81] is to use the columns of as embedding ^p. Eor easier comparison, the last column also shows equivalent representations of 
this embedding based on the same eigen decompositions V'KV as those used for our isometry eigenmaps in Tab. 3. In contrast, our 
embeddings are derived from justified approximations of the original non-relaxed AA, AC, or NC objectives. Note that NC corresponds 
to a weighted case of K-means with data point weights Wp = dp [52], [40], see (42) in Section 1.3.1. 


Consider rank-m matrix M := as a new minimization 

variable. Its optimal value + A^)^"^ follows from 

17“ 2 (^SD + A)D~^ = 1/'((5I + A)y for eigen decomposition 

D~^AD~^=V'KV. (64) 

Thus, optimal rank-m approximation kernel K is 

JC = D-^ (65) 

It is easy to check that m-dimensional embedding (also in Tab. 3) 


(])p 




( 66 ) 


is isometric to kernel JC, that is {(j)py(j)q) = JCpq. Therefore, 
weighted KM (32) over low-dimensional embedding <pp (66) with 
weights Wp = dp approximates NC objective (41). 

Summary: The ideas above can be summarized as follows. 
Assume AA, AC, or NC objectives Ea{S) with (symmetric) A. 
The third column in Table 3 shows kernels JC for equivalent kKM 
objectives F(S) (22,36). Eollowing eigenmap approach (Eig.8), 
we find rank-m approximate kernel JC ^ 1C minimizing Erobenius 
error \JC - 1C\f (57) or its weighted version (63) and deduce 
embeddings (pp g IZ^ (61), (62), (66) satisfying isometry 


t>p4>q - JCpq i 


JC 


pq- 


Basic K-means objective F(S^m) (16,32) for {(pp} is equivalent 
to kKM energy F{S) (22,36) for kernel JC ^ JC and, therefore, 
approximates the original pairwise clustering objective 


F{S,ps) - F{S) ^ F{S) ^ Ea{S). 


Theorem 1 gives unary (linear) bound F(S^pt) (49,50) for objec¬ 
tive F(S) (16,32). We refer to F(S^pt) as a spectral auxiliary 
function for approximate optimization of Ea{S) (last column in 
Table 3). We will also simply call F(S,pt) a spectral bound, not 
to be confused with a similar term used for matrix eigenvalues. 

Similarly to kernel bound in Section 2.2, spectral bound is 
useful for optimizing (1). In fact, for m = \Cl\ the spectral bounds 
(Tab. 3) are algebraically equivalent to the kernel bounds (Tab. 2) 
since JC = JC, see (58). Eor m < |0| we have JC ^ JC and F 
F. Therefore, we can iteratively minimize energy E(S) (1) by 
applying bound optimization to its spectral approximation 

E{S) = F{S)+ 7 Z^ciSc) (67) 

or its weighted spectral approximation 

E(S) = F^{S)+ 7 Z^ciSc). (68) 

Theorem 3 (spectral bound). For any (symmetric) affinity matrix 
A assume sufficiently large diagonal shift S generating p.s.d. 
kernel JC as in Table 3. Then, auxiliary function 

atiS) = F(S,nt) + 7 (69) 

using F(S^m) (49,50) with embedding {fp} c: FF' in Tab. 3 is a 
bound for joint energy (67,68) approximating (1) as m ^ |(7|. 

Approximation quality (58) depends on omitted eigenvalues 
for i > m. Representative examples in Eig.lO show that relatively 
few eigenvalues may dominate the others. Thus, practically good 



























15 


Algorithm 2: ^-Expansion for Spectral Cut 

Input : Affinity Matrix A of size |ri| x |ri|; initial labeling Sq, 

Output: : partition of the set Q 

Find top m eigenvalues/vectors for a matrix in the 4*^ col. of Tab. 3 ; 

Compute embedding {0p} c for some <5 and set t := 0; 
while not converged do 

Set dt{S) to be the spectral bound (69) at current partition St ; 
for each label a € C = {1, K} do 
I Find St •= arg min dt (S') within one a expansion of St ; 

end 

Set t := t + 1; 

end 


approximation with small m is possible. Larger m are compu¬ 
tationally expensive since more eigenvalues/vectors are needed. 
Interestingly, smaller m may give better optimization since K- 
means in higher-dimensional spaces may be more sensitive to 
local minima. Thus, spectral bound optimization for smaller m 
may find solutions with lower energy, see Fig. 11, even though the 
quality of approximation is better for larger m. 

Similarly to the kernel bound algorithms discussed in Sec¬ 
tion 2.3 one can optimize the approximate spectral bound (69) 
for energy (1) using standard algorithms for regularization. This 
follows from the fact that the first term in (69) is unary (linear). 
Algorithms 2 shows a representative (approximate) bound opti¬ 
mization technique for ( 1 ) using move-making algorithms [82]. 
Note that for 7 = 0 (no regularization terms) our bound opti¬ 
mization Algorithm 2 reduces to basic K-means over approximate 
isometry embeddings {(pp} similar but not identical to 

common discretization heuristics in spectral relaxation methods. 

3.3 Relation to spectral clustering 

Our approximation of pairwise clustering such as NC via basic 
KM over low dimensional embeddings (f)p is closely related 
to popular spectral clustering algorithms [ 8 ], [79], [48] using 
eigen decomposition for various combinations of kernel, affinity, 
distortion, laplacian, or other matrices. Other methods also build 
low-dimensional Euclidean embeddings [79], [48], [49] for basic 
KM using motivation different from isometry and approximation 
errors with respect to given affinities. We are mainly interested in 
discussing relations to spectral methods approximately optimizing 
pairwise clustering criteria such as AA, AC, and NC [ 8 ]. 

Many spectral relaxation methods also use various eigen 
decompositions to build explicit data embeddings followed by 
basic K-means. In particular, the smallest or largest eigenvectors 
for the (generalized) eigenvalue problems in Table 4 give well- 
known exact solutions for the relaxed problems. In contrast to our 
approach, however, the final K-means stage in spectral methods is 
often presented without justification [ 8 ], [81], [67] as a heuristic 
for quantizing the relaxed continuous solutions into a discrete 
labeling. It is commonly understood that 

“... there is nothing principled about using the K-means 
algorithm in this step” (Sec. 8.4 in [81]) 
or that 

“... K-means introduces additional unwarranted as¬ 
sumptions” (Sec. 4 in [59]) 

Also, typical spectral methods use K eigenvectors solving the 
relaxed iT-cluster problems followed by KM quantization. In con¬ 
trast, we choose the number of eigenvectors m based on Frobenius 
error for isometry approximation (58). Thus, the number m is 
independent from the predefined number of clusters. 


Below we juxtapose our approximate isometry low¬ 
dimensional embeddings in Table 3 with embeddings used for ad- 
hoc discretization by the standard spectral relaxation methods in 
Table 4. While such embeddings are similar, they are not identical. 
Thus, our Frobenius error argument offers a justification and minor 
corrections for KM heuristics in spectral methods, even though the 
corresponding methodologies are unrelated. More importantly, our 
bound formulation allows integration of pairwise clustering with 
additional regularization constraints ( 1 ). 

Embeddings in spectral methods for NC: Despite similarity, 
there are differences between our low-dimensional embedding 
( 66 ) provably approximating kernel JC = 5D~^ + D~^AD~^ 
for the kKM formulation of NC [52], [40] and common ad- 
hoc embeddings used for KM discretization step in the spectral 
relaxation methods. For example, one such discretization heuristic 
[ 8 ], [81] uses embedding (f)p (right column in Tab. 4) defined by 
the columns of matrix whose rows are the K top (unit) eigen¬ 
vectors of the standard eigen system (left column). It is easy to 
verify that the rows of matrix VD~^ are non-unit eigenvectors for 
the generalized eigen system for NC. The following relationship 

7 = 

where operator normalizes matrix rows, demonstrates certain 
differences between ad hoc embeddings used by many spectral 
relaxation methods in their heuristic K-means discretization step 
and justified approximation embedding ( 66 ) in Tab. 3. Note that 
our formulation scales each embedding dimension, i.e. rows in 
matrix D~^, according to eigenvalues instead of normalizing 
these rows to unit length. 

There are other common variants of embeddings for the K- 
means discretization step in spectral relaxation approaches to the 
normalized cut. For example, [83], [ 66 ], [67] use 

7 = [A-^C/]^ 

for discretization of the relaxed NC solution. The motivation 
comes from the physics-based mass-spring system interpretation 
[83] of the generalized eigenvalue system. 

Some spectral relaxation methods motivate their discretization 
procedure differently. For example, [59], [52] find the closest 
integer solution to a subspace of equivalent solutions for their 
particular very similar relaxations of NC based on the same eigen 
decomposition (64) that we used above. Yu and Shi [59] represent 
the subspace via matrix 

X' E 

where columns differ from our embedding (j){Ip) in ( 66 ) only by 
normalization. Theorem 1 by Bach and Jordan [52] equivalently 
reformulates the distance between the subspace and integer label¬ 
ings via a weighted K-means objective for embedding 



and weights Wp = dp. This embedding is different from ( 66 ) only 
by eigenvalue scaling. 

Interestingly, a footnote in [52] states that NC objective (41) 
is equivalent to weighted KM objective (32) for exact isometry 
embedding 

<pp = ^Gp (71) 

dp 
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(c) mPb kernel [67] for image 




xIO^ 


(d) KNN kernel for image 


Fig. 10: Spectrum of eigenvalues of typical kernel matrices for 
synthetic data (top row) or real image color (bottom row). This 
helps us to select approximate embedding so as to have small 
approximation error (58). For example, with fixed width gaussian 
kernel in (a), it suffices to select a few top eigenvectors since 
the remaining eigenvalues are negligible. Note that the spectrum 
elevates with increasing diagonal shift S in (61). In principle, we 
can find the optimal shift for a given number of dimensions m to 
minimize approximation error. 


- weighted K-means plus constl 

- Normalied cuts Energy 



3 dim. embedding 




Iteration 


Fig. 11: For data and affinity matrix in Fig. 9, we run weighted 
K-means with our approximate embedding. The approximation 
errors ||/C - /C|||r/||/C|||r for 3, 6, 10 and 50 dim. embedding are 
58%, 41%, 27% and 3% respectively. We compute weighted K- 
means energy (up to a const) and normalized cuts energy for 
solution obtained at each iteration. We observed that normalized 
cuts energy indeed tends to decrease during iterations of K- 
means. Even 10 dim. embedding gives good alignment between 
K-means energy and normalized cuts energy. Higher dimensional 
embedding gives better energy approximation, but not necessarily 
better solution with lower energy. 


based on any decomposition A = G'G. For example, our exact 
isometry map (66) for m = \ft\ and G = \/AVD^ is a special 
case. While [52] reduce NC to K-means^ \ their low-dimensional 
embedding 0 (70) is derived to approximate the subspace of 
relaxed NC solutions. In contrast, low-dimensional embedding 
(66) approximates the exact esometry map (j) ignoring relaxed 
solutions. It is not obvious if decomposition A = G'G for the 
exact embedding (71) can be used to find any approximate lower¬ 
dimensional embeddings like (66). 

4 Other optimization ideas 

This Section outlines several extensions for optimization methods 
presented in Sec.2 and 3. For example, we vary diagonal shifts 6 to 
reduce Frobenius approximation error ||^ - /C||i7 (58) between the 
optimal rank-m matrix JC and kernels JC = A + 6I, see Sec.4.1. We 
also discuss pseudo-bound [37] and trust region [84] as alternative 
approximate optimization concepts that may improve our bound 
optimization framework for high-order energy (1), see Sec.4.2. 

4.1 Extensions for spectral bound optimization 

We derive low-dimensional embeddings (61),(62), (66) in a princi¬ 
pled way allowing to estimate approximation errors with respect to 
the exact pairwise clustering criteria. Frobenius errors (58) depend 
on embedding dimensionality m and diagonal shift S that can 
increase or decrease all eigenvalues A^, see Fig. 10. Unlike typical 
discretization heuristics in spectral relaxation methods, our choice 
of m is not restricted to the fixed number of clusters K. Moreover, 

11. KM procedure (23) (weighted version) is not practical for objective (32) 
for points 4>p in Instead, Dhillon et al. [40] later suggested pairwise KM 
procedure (24) (weighted version) using kernel Kpq = {(j)p^(f)q). 


for each fixed m we can find an optimal shift 6 minimizing the 
sum of squared norms of omitted eigenvalues as follows. 

For simplicity we focus on AA with symmetric affinity 
A = V'AV even though our technique can be easily extended to 
AC and NC. Diagonal shift 6 ensures positive semi-definiteness 
of the kernel matrix JC(S) = 61 + A for an equivalent kKM. 
For approximate low dimensional embedding (61) it is enough 
to guarantee positive semi-definiteness of rank-m approximating 
kernel }C(6) 

iC{6) = {V^)'{61^ + . 

Thus one should use 6 such that all eigenvalues \i for i <m are 
non-negative. Given this restriction one can choose 6 to minimize 
Frobenius approximation error (58): 

mm\\JC(S) - s.t. (5>-minAi. (72) 

^ i<m 

Making use of (58), the objective in (72) is 
giving optimum 

^ ^ tr(V(0)-V(0)) 

\ft\-m \ft\-m 

automatically satisfying the constraint in (72). Indeed, -6"^ is 
the mean value of the discarded eigenvalues A^, so the shifted 
discarded eigenvalues (A^ + 6"^) must contain both positive and 
negative values. Since and use the largest eigenvalues 

{A^|i < m} we have 

mm{Xi + 6"^) > max(Ai + (5"^) > 0 

i<m i>m 

and the constraint in (72) is satisfied. In practice the effect of 
the diagonal shift mainly depends on the whole spectrum of 
eigenvalues for a particular affinity matrix, see Figure 10. 
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4.2 Pseudo-bound and trust region methods 

Our bound optimization approach to energy (1) can be extended in 
several ways. For example, following pseudo-bound optimization 
ideas [37], parametric methods [56] can be used to optimize our 
bounds with additional perturbation terms reducing sensitivity to 
local minima. It also makes sense to use our (spectral) bounds 
as an approximating functional for (1) in the context of trust 
region approach to optimization [84]. We also note that parametric 
methods can explore all diagonal shifts S alleviating the need for 
expensive eigen decomposition of large matrices. 

Pseudo-bound optimization: Consider the following defini¬ 
tion introduced in [37]. 


Definition 1. (pseudo-bound) Given energy E(S) and param¬ 
eter T] e IZ, functional Bt{S^r]) is a pseudo-bound for energy 
E{S) at current solution St if there exists at least one p' such 
that Bt(S,p') is an auxiliary function of E(S) at St. 


Instead of using an auxiliary function, one can optimize a 
family of pseudo-bounds that includes at least one proper bound. 
This guarantees that original functional E(S) decreases when 
the best solution is selected among the global minima for the 
whole family [37]. In the meanwhile, such pseudo-bounds may 
approximate E(S) better than a single auxiliary function, even 
though they come from the same class of sub-modular (globally 
optimizable) functionals. The experiments of [37] confirmed that 
pseudo-bounds significantly outperform the optimization quality 
obtained by a single auxiliary function in the context of several 
high-order segmentation functionals, e.g., entropy [85], Bhat- 
tacharyya measure [86] and KL divergence [73]^^. If the pseudo¬ 
bounds are monotone w.r.t. parameter p, we can find all global 
minima for the whole family in polynomial time via parametric 
max-flow algorithm [56]. This practical consideration is important 
when building a pseudo-bound family. For instance, [37] built 
pseudo-bounds by simply adding monotone unary terms p\S^\ to 
the auxiliary functions. We can use a similar perturbation term in 
the context of auxiliary functions in Table 2 and 3. 

As a proof-of-the-concept, we included Figure 12 demonstrat¬ 
ing one synthetic binary clustering example. It uses standard NC 
objective (41) with p.d. Gaussian kernel (requiring no diagonal 
shift) and no additional regularization terms. Basic kernel bound 
optimization converges to a weak local minimum, while pseudo¬ 
bound optimization over all parameters p in perturbation term 
7715'^ I achieves a much better solution with lower energy. This 
toy example suggests that pseudo-bound approach may compete 
with standard spectral relaxation methods [8]. 

A different perturbation term is motivated by using diagonal 
shift as a perturbation parameter p = 6. Besides improving 
optimization, efficient search over S may make it unnecessary to 
compute expensive eigen decompositions when estimating diag¬ 
onal shift for proper p.s.d. kemels/affinities in the third columns 
of Table 2 and 3. Consider function e(X) defined by symmetric 
affinities A and weights w 


e(X) 


X'AX 

w'X 


which is similar to e{X) in (51) except that A is not necessarily 
positive definite. Thus, e{X) may not be concave as a function 
of relaxed continuous argument X, see Lemma 1. Pseudo-bound 


for clustering objectives in the general form Ea{S) = 

can be derived from diagonal shift SW + A for W := diag{w) 

resulting in equivalent objectives. 

Theorem 4 (pseudo-bound). Consider pairwise clustering ob¬ 
jectives inform Ea{S) '.= The following is a pseudo- 

bound of Ea(S) at St 

i ^ (73) 

k k ^t 

becoming a proper auxiliary function for 5 > -Ao(FF ^ AW ^) 
where Aq denotes the smallest eigenvalue of the matrix. The 
corresponding pseudo-bound for the joint energy combining high- 
order clustering term Ea with regularization terms in (1) is 

Bt{S,5) + ^ Y.Ec{Sc). (74) 

c^T 

Proof Implied by Lemma 2 (Appendix D) after omitting 5K, 
which is a constant w.r.t. S not affecting optimization. □ 


Theorem 4 provides pseudo-bound (74) for joint energy (1) 
combining A A, AC, NC objectives in Table 2 with regularization 
potentials. When the number of segments is iT = 2 parametric 
max-flow methods [56] could be used to explore all distinct opti¬ 
mal solutions S{5) for pseudo-bound (74) for all 5. In particular, 
this search covers S where (74) is a proper bound removing the 
need for explicitly evaluating Xo(D^ AD^) via expensive eigen 
decomposition. However, in contrast to the common perturbation 
term 77|5'^| [37], it is easy to check that the second term in 
Bt(S^ S) (73) is not monotonic [56] with respect to 5. Thus, there 
is no guarantee that optimal object segments S^ (5) form a nested 
sequence w.r.t. paprameter S and that the number of such distinct 
discrete solutions is bounded by 1 + |17|. 

Note that a parametric search could also be implemented for 
K > 2 in the context of move making algorithms, see Section 2.3, 
using parametric max-flow [56] to explore all S at each expansion 
or other move [10]. If the number of distinct solutions for all S gets 
too large at each expansion due to lack of monotonicity, one can 
use simple practical heuristics. For example, restricted expansion 
moves can be limited to “monotone” subsets of pixels with either 
positive or negative unary potential with respect to 6. 

Pseudo-bound Bt(S,S) (73) could be useful for pairwise 
clustering without regularization when monotonicity is not needed 
to guarantee efficiency. Indeed, for iT = 2 a unary potential for 
each pixel in (73) changes its sign only once as parameter 6 
increases. It is enough to sort all critical values of S changing 
at least one pixel in order to traverse all (at most 1 + \Q\) distinct 
solutions for pseudo bound (73) in a linear time. For iT > 2 it is 
easy to check that the optimal label for each pixel changes at most 
K -1 times as 6 increases^^. Critical values of 6 for all pixels can 
be sorted so that all (at most l + (iT-l)|(7|) distinct solutions for 
pseudo bound (73) can be explored in a linear time. 

Trust region optimization: Non-monotonicity of the second 
(perturbation) term in (73) follows from the factor (1 - 2St) 
assigning unary potentials of different signs to pixels inside and 
outside the current segment St . This term can be represented as 


_ {i-2si^yws^ 




12. The segmentation functionals considered in [37] are completely different 
from the pairwise clustering energies we consider in this work 


13. The optimal value of a unary potential for each pixels is the lower 
envelope of K linear functions of 6, which has at most K - 1 breakpoints. 
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Fig. 12: Normalized cut (NC) objective (41). This proof-of- 
concept example shows that our pseudo-bound optimization (c) 
is less sensitive to local minima compared to the standard /cKM 
algorithm (bound optimizer) in (b). Pseudo-bound approach can 
also compete with spectral relaxation methods [8] in (d). 

where - S^\\w is a weighted Hamming distance between 
and the current segment Sf. Thus, instead of bounds, functions 
(73) and (74) can be treated as Lagrangians for the constrained 
optimization of the unary or higher-order approximations of the 
corresponding objectives over trust region [84] 

IIS'* - < T. 

In this case, parameter (5 is a Lagrange multiplier that can be 
adaptively changed from iteration to iteration [84], rather than 
exhaustively explored at each iteration. Note that it may still be 
useful to add a common monotone perturbation term as in [37] 

7?|S*| = 7?1'S* 

that can be efficiently explored at each iteration. This corresponds 
to a combination of pseudo-bound and trust region techniques. 

5 Parzen Analysis & Bandwidth Selection 

This section discusses connections of fcKM clustering to Parzen 
densities providing probabilistic interpretations for many equiv¬ 
alent pairwise clustering objectives discussed in our work. In 
particular, this section gives insights on practical selection of 
kernels or their bandwidth. We discuss extreme cases and analyze 
adaptive strategies. For simplicity, we mainly focus on Gaussian 
kernels, even though the analysis applies to other types of positive 
normalized kernels. 

Note that standard Parzen density estimate for the distribution 
of data points within segment can be expressed using normal¬ 
ized Gaussian kernels [87], [44] 

-PAMS'-) - (75) 

It is easy to see that /cKM energy (22) is exactly the following 
high-order Parzen density energy 

^(^) = -E E (76) 

k 


This probabilistic interpretation of /cKM gives an additional 
point of view for comparing it with pKM clustering with log- 
likelihood energy (14). Instead of parametric ML models /cKM 
uses Parzen density estimates. Another difference is absence of the 
log in (76). Omitting the log reduces the weight of low probability 
points, that is, outliers. In contrast, log-likelihoods in (14) are 
highly sensitive to outliers. To address this problem, pKM methods 
often use heuristics like mixing the desired probability model V 
with a uniform distribution, e.g. V(-\0) := e + (1 - e)V(-\0). 

5.1 Extreme bandwidth cases 

Parzen energy (76) is also useful for analyzing two extreme cases 
of kernel bandwidth: large kernels approaching the data range 
and small kernels approaching the data resolution. This section 
analyses these two extreme cases. 

Large bandwidth and basic K-means: Consider Gaussian 
kernels of large bandwidth a approaching the data range. In this 
case Gaussian kernels k in (75) can be approximated (up to a 

II / _J II2 

scalar) by Taylor expansion 1 - • Then, Parzen density 

energy (76) becomes (up to a constant) 

T^pqeS^ II “ ^<7 II 

t 2^\ 

which is proportional to the pairwise formulation for the basic K- 
means or variance criteria in Tab.l with Euclidean metric || ||. That 
is, kKM for large bandwidth Gaussian kernels reduces to the basic 
K-means in the original data space instead of the high-dimensional 
embedding space. 

In particular, this proves that as the bandwidth gets too large 
kKM looses its ability to find non-linear separation of the clusters. 
This also emphasizes the well-known bias of basic K-means to 
equal size clusters [27], [88]. 

Small bandwidth and Gini criterion: Very different prop¬ 
erties could be shown for the opposite extreme case of small 
bandwidth approaching data resolution. It is easy to approximate 
Parzen formulation of kKM energy (76) as 

F(5) S (77) 

k 

where VaiS^) is kernel-based density (75) and ds is a “true” 
continuous density for the sample of intensities {Ip\p e S^} in 
segment S^. Approximation (77) follows directly from the same 
Monte-Carlo estimation argument in Appendix C with the only 
difference being / = -Va(S^) instead of - log V(Os). 

If kernels have small bandwidth optimal for accurate Parzen 
density estimation^"^ we get Va(S^) ^ d^ further reducing (77) to 
approximation 

s -^|5 *i-<4,4) 

k 

that proves the following property. 

Property 1. Assume small bandwidth Gaussian kernels optimal 
for accurate Parzen density estimation. Then kernel K-means 
energy (76) can be approximated by the standard Gini criterion 
for clustering [35]: 

^g(5) := EI^TG(5*) (78) 

k 

14. Bandwidth near inter-point distances avoids density oversmoothing. 
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where G(S^) is the Gini impurity/6>r the data points in 

G(S^) := l-(4,4> = 1- r d^xfdx. (79) 

Jx 


Similarly to entropy, Gini impurity G(S^) can be viewed 
as a measure of sparsity or “peakedness” for continu¬ 
ous or discrete distributions. Both Gini and entropy cluster¬ 
ing criteria are widely used for decision trees [35], [36]. 
In this discrete context Breiman [35] analyzed theoretical 
properties of Gini criterion (78) for the 
case of histograms Vh where G{S^) = 
1 “ Ecc He proved that for 

K = 2 thQ minimum of the Gini cri¬ 
terion is achieved by sending all data 
points within the highest-probability 
bin to one cluster and the remaining 
data points to the other cluster, see the 
color encoded illustration above. We extend Brieman’s result to 
the continuous Gini criterion (78)-(79). 




Theorem 5 (Gini Bias, K = 2). Let be a continuous proba¬ 
bility density function over domain Vt ^ defining conditional 
density ds(x) := dQ(x\x e S^) for any non-empty subset c (2. 
Then, continuous version of Gini clustering criterion (78) achieves 
its optimal value at the partitioning of Ll into regions and 
= fl \ such that 


= argmax(iQ(x). 

X 

Proof See Appendix E and Proposition 2. □ 

The bias to small dense clusters is practically noticeable for 
small bandwidth kernels, see Fig. 14(d). Similar empirical bias 
to tight clusters was also observed in the context of average 
association (22) in [8]. As kernel gets wider the continuous Parzen 
density (75) no longer approximates the true distribution ds and 
Gini criterion (78) is no longer valid as an approximation for 
/cKM energy (76). In pratice, Gini bias gradually disappears as 
bandwidth gets wider. This also agrees with the observations 
for wider kernel in average association [8]. As discussed earlier, 
in the opposite extreme case when bandwidth get very large 
(approaching data range) kKM converges to basic K-means or 
variance criterion, which has very different properties. Thus, 
kernel K-means properties strongly depend on the bandwidth. 

The extreme cases for kernel K-means, i.e. Gini and variance 
criteria, are useful to know when selecting kernels. Variance 
criteria for clustering has bias to equal cardinality segments [27], 
[88]. In contrast, Gini criteria has bias to small dense clusters 
(Theorem 5). To avoid these biases kernel K-means should use 
kernels of width that is neither too small nor too large. Our 
experiments compare different strategies with fixed and adaptive- 
width kernels (Sec. 5. 2). Equivalence of kernel-K-means to many 
standard clustering criteria such as average distortion, average as¬ 
sociation, normalized cuts, ^tc(see Sec. 1.2.2) also suggest kernel 
selection strategies based on practices in the prior literature. 


5.2 Adaptive kernels via Nash embedding and KNN 

As discussed in Sec. 5.1, kernel width should neither be too 
small nor too large. We propose adaptive kernels designed to 
equalize the density in highly dense regions in the color space. 


space of points / 
with Riemannian metric 



unit balls in Riemannian metric defined by tensor H 


transformed points I' 
with Euclidian metric 



Fig. 13: Nash embedding: adaptive non-normalized Gaussian ker¬ 
nels define isometric transformation of the color space modifying 
density. Ellipsoids are mapped to balls. 


The following equation interprets adaptive Gaussian kernels via 
Riemannian distances in the color space (left picture in Fig. 13) 

~\\dp~Iq 11 ^ 

fcp(4,/,) = e 

= e 2 , 


According to Nash embedding theorem [89], this Riemannian 
color space can be isometrically embedded into a Euclidean space, 
so that the last expression above is equal to 



Hlpj'g) 


where k is a fixed-width Gaussian kernel in the new transformed 
space (right picture in Fig. 13). Thus, non-normalized Gaussian 
kernels of adaptive width cfp (or covariance matrix Tjp, in general) 
define some color space transformation, Nash embedding, that 
locally stretches or shrinks the space. After this transformation, 
clustering is done using a fixed (homogeneous) Gaussian kernel 
of constant width. 

Figure 13 helps to illustrate how Nash embedding changes the 
color space density. The number of points in a unit (Euclidean) ball 
neighborhood in the transformed space is equal to the number of 
points in the corresponding unit (Riemannian) ball in the original 
space: 

K = d'-Vi^d-Va 


where d and d! are local densities in the original and transformed 
spaces. Thus, kernel width Gp can be selected adaptively based on 
any desired transformation of density d!(d) according to formula 


d'(dp) 


\ 


(80) 


where dp := d(Ip) is an observed local density for points in the 
color space. This local density can be evaluated using any common 
estimator, e.g. Parzen approach gives 


d{Ip) 



(81) 


where Aq could be adaptive or fixed Aq = const, according to 
any standard technique for density estimation [87]. 

To address Breiman bias one can use density equalizing 
transforms d'{d) = const or d' = ^log(l + ad), which even 
up the highly dense parts of the color space. Formula (80) works 
for any target transform d\d). Once adaptive kernels Gp are 
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Using/fxed width kernei 


Using adaptive width kernei 



Original density 
{e) D^n$ily mapping 



{a) Input image 


0 

{S>) 2D Color ipaw hr^tagram 




{c) Kernel Clusterirjg resuit (d} OBJ (BKG) pixels in red (blue) 


{f) Kerrei Clustering result 


Fig. 14: (a)-(d): Breiman bias for fixed (small) kernel, (e) Em¬ 
pirical density transform for adaptive kernels (80) with d'(d) = 
const. Color space density equalization counters Breiman bias, 
(f) Segmentation for adaptive kernels. 



chosen, Nash theory also allows to obtain empirical scatter plots 
d'(d)Q := {(d'(Ip), d{Ip))\p e f]}, for example, to compare it 
with the selected “theoretical” plot d'(d). Estimates d(Ip) are 
provided in (81) and the density estimate for Nash embedding are 


d'O'r,) 



(82) 


Note the difference between empirical density estimates for d' in 
(82) and d in (81); the former uses the sum of non-normalized 
kernels of selected adaptive width (jq in (80) and the latter is 
the sum of normalized kernels of width Ag based on chosen 
density estimator. While parameter (Tq directly controls the density 
transformation, A plays a fairly minor role concerning the quality 
of estimating density d. 

Figure 14(e) illustrates the empirical density mapping d'(d)Q 
induced by adaptive kernels (80) for d'(d) = const. Notice a 
density-equalization effect within high density areas in the color 
space addressing the Breiman bias. 

The const density mapping can be approximated using KNN 
graph. To be specific, the symmetric KNN kernel in this paper is 
defined as follows: 


kpg = k{U,U) = [U^KNNU,)] 

+ [/, £ KNNiU)] (83) 

where KNN(fp) is a set of K nearest neighbors of fp. The 
affinity between fp and fq achieves maximum value of 2 if they 
are mutually each other’s KNNs. 


6 Experiments 

This section is divided into two parts. The first part (Sec. 6.1) 
shows the benefits of extra MRF regularization for kernel & 
spectral clustering, e.g. normalized cut. We consider pairwise 
Potts, label cost and robust bin consistency term, as discussed in 
Sec. 1.1. We compare to spectral clustering [8], [66] and kernel 
K-means [40], which can be seen as degenerated versions for 
spectral and kernel cuts (respectively) without MRF terms. We 
show that MRF helps kernel & spectral clustering in segmentation 
and image clustering. In the second part (Sec. 6. 2) we replace the 


log-likelihoods in model-fitting methods, e.g. GrabCut [26], by 
pairwise clustering term, e.g. A A and NC. This is particularly ad¬ 
vantageous for high dimension features (location, depth, motion). 

Implementation details: For segmentation, our kernel cut 
uses KNN kernel (83) for pixel features Ip, which can be 
concatenation of FAB (color), XY (location) and M (motion or 
optical flow) [90]. We choose 400 neighbors and randomly sample 
50 neighbors for each pixel. Sampling does not degrade our 
segmentation but expedites bound evaluation. We also experiment 
with popular mPb contour based affinities [67] for segmentation. 
The window radius is set to 5 pixels. 

For contrast-sensitive regularization, we use standard penalty 
^ ■^e~^'^^dp-iq\\ 2 /p for ( 2 ), where 77 is the average of \\Ip - 

Iq\\‘^ over a 8-connected neighborhood and dpq is the distance 
between pixels p and q in the image plane. We set Wpq = for 
length regularization. 

For GrabCut, we used histogram-based probability model, as 
is common in the literature [85], [91]. We tried various bin size 
for spatial and depth channels. 

With fixed width Gaussian kernel used in Fig. 14, the time 
complexity of the naive implementation of kernel bound eval¬ 
uation in (55) is 0(|(7|^). The bottleneck is the evaluation of 
JCXt and XflCXt in derivative Ve(Xt) (52). In this case, we 
resort to fast approximate dense filtering method in [92], which 
takes 0(1^1) time. Also notice that the time complexity of the 
approach in [92] grows exponentially with data dimension N. A 
better approach for high-dimensional dense filtering is proposed 
in [93], which is of time 0(|0| x A). We stick to [92] for low¬ 
dimensional color space in experiments in Fig. 14. 

6.1 MRF helps Kernel & Spectral Clustering 

Here we add MRF regulation terms to typical normalized cut 
applications, such as unsupervised multi-label segmentation [67] 
and image clustering [94]. Our kernel and spectral cuts are used to 
optimize the joint energy of normalized cut and MRF (1) or (68). 

6.1.1 Normalized Cut with Potts Regularization 

Spectral clustering [8] typically solves a (generalized) eigen prob¬ 
lem, followed by simple clustering method such as K-means on 
the eigenvectors. However, it is known that such paradigm results 
in undesirable segmentation in large uniform regions [67], [66], 
see examples in Fig. 15. Obviously such edge mis-alignment can 
be penalized by contrast-sensitive Potts term. Our spectral and 
kernel cuts get better segmentation boundaries. As is in [40] we 
use spectral initialization. 

Tab.5 gives quantitative results on BSDS500 dataset. Number 
of segments in ground truth is provided to each method. It 
shows that kernel and spectral cuts give better covering, PRI 
(probabilistic rand index) and VOI (variation of information) than 
spectral clustering. Fig. 15 gives sample results. Kernel K-means 
[40] gives results similar to spectral clustering and hence are not 
shown. 

6.1.2 Normalized Cuts with Label Cost [ 12] 

Unlike spectral clustering, our kernel and spectral cuts do not need 
the number of segments beforehand. We use kernel cut to optimize 
a combination of the normalized cut, Potts model and label costs 
terms. The label cost (4) penalizes each label by constant h^. 
The energy is minimized by ^-expansion and Q(/5-swap moves 
in Sec. 2.3. We sample initial models from patches, as in [12]. 
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Fig. 15: Sample results on BSDS500. Top row: spectral clustering. Middle & Bottom rows: our Kernel & Spectral Cuts. 


method 

Covering 

PRI 

VOI 

Spectral Clustering 

0.34 

0.76 

2.76 

Our Kernel Cut 

0.41 

0.78 

2.44 

Our Spectral Cut 

0.42 

0.78 

2.34 


TABLE 5: Results of spectral clustering (K-means on eigenvec¬ 
tors) and our Kernel Cut & Spectral Cuts on BSDS500 dataset. 
For this experiment mPb-based kernel is used [67]. 

Results with different label cost are shown in Fig. 16. Due to 
sparsity prior, our kernel and spectral cuts automatically prune 
weak models and determine the number of segments, yet yield 
regularized segmentation. We use KNN affinity for normalized 
cut and mPb [67] based Potts regularization. 

6.1.3 Normalized Cut with High Order Consistency Term 
[11], [13], [14] 

It is common that images come with multiple tags, such as those 
in Flickr platform or the LabelMe dataset [95]. We study how to 
utilize tag-based group prior for image clustering [94]. 

We experiment on the LabelMe dataset [95] which contains 
2,600 images of 8 scene categories (coast, mountain, forest, open 
country, street, inside city, tall buildings and highways). We use 
the same GIST feature, affinity matrix and group prior as used 
in [94]. We found the group prior to be noisy. The dominant 
category in each group occupies only 60%-90% of the group. 
The high-order consistency term is defined on each group. For 
each group, we introduce an energy term that is akin to the robust 

-Potts [11], which can be exactly minimized within a single 
a/3-swap or ^-expansion move. Notice that here we have to use 
robust consistency potential instead of rigid ones. 

Our kernel cut minimizes NC plus the robust P’^-Potts term. 
Spectral cut minimizes energy of (67). Normalized mutual infor¬ 
mation (NMI) is used as the measure of clustering quality. Perfect 
clustering with respect to ground truth has NMI value of 1. 

Spectral clustering and kernel K-means [40] give NMI value 
of 0.542 and 0.572 respectively. Our kernel cut and spectral cut 
significantly boost the NMI to 0.683 and 0.681. Fig. 17 shows 
the results with respect to different amount of image tags used. 
The left most points correspond to the case when no group prior 
is given. We optimize over the weight of high order consistency 


term, see Fig. 17. Note that it’s not the case the larger the weight 
the better since the grouping prior is noisy. 

We also utilize deep features, which are 4096 dimensional 
fc7 layer from AlexNet [96]. We either run plain K-means, or 
construct a KNN kernel on deep features. These algorithms are 
denoted as deep K-means, deep spectral cut or deep kernel cut 
in Fig. 17. Incorporating group prior indeed improved clustering. 
The best NMI of 0.83 is achieved by our kernel cut and spectral 
cut for KNN kernel on deep features. 

6.2 Kernel & Spectral Clustering helps MRF 

In typical MRF applications we replace the log-likelihood terms 
by average association or normalized cut. We evaluate our Kernel 
Cut (fixed width kernel or KNN) in the context of interactive 
segmentation, and compare with the commonly used GrabCut 
algorithm [26]. In Sec. 6.2.1, we show that our kernel cut is less 
sensitive to choice of regularization weight 7. We further report 
results on the GrabCut dataset of 50 images and the Berkeley 
dataset in Sec. 6.2.2. We experiment with both (i) contrast- 
sensitive edge regularization, (ii) length regularization and (iii) 
color clustering (i.e., no regularization) so as to assess to what 
extent the algorithms benefit from regularization. 

From Sec. 6.2.3 to Sec. 6.2.6, we also report segmentation re¬ 
sults of our kernel cut with high-dimensional features Ip, including 
location, texture, depth, and motion respectively. 

6.2.1 Robustness to regularization weight 

We first run all algorithms without smoothness. Then, we experi¬ 
ment with several values of 7 for the contrast-sensitive edge term. 
In the experiments of Fig. 18 (a) and (b), we used the yellow boxes 
as initialization. For a clear interpretation of the results, we did not 
use any additional hard constraint. In Fig. 18, ”KemelCut-KNN- 
AA” means Kernel Cut with KNN kernel for average association 
(AA). Without smoothness, our Kernel Cut yields much better 
results than Grab Cut. Regularization significantly benefited the 
latter, as the decreasing blue curve in (a) indicates. For instance, 
in the case of the zebra image, model fitting yields a plausible 
segmentation when assisted with a strong regularization. However, 
in the presence of noisy edges and clutter, as is the case of the 
chair image in (b), regularization does not help as much. Note 
that for small regularization weights 7 our method is substantially 
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boundary 

smoothness 

GrabCut 

color clustering term 

KernelCut KernelCut 
-Gau-AA -Gau-NC 

KernelCut 

-KNN-AA 

none 

27.2 

20.4 

17.6 

12.2 

Euclidean length 

13.6 

15.1 

16.0 

10.2 

contrast-sensitive 

8.2 

9.7 

13.8 

7.1 


TABLE 6: Box-based interactive segmentation (Fig.21). Error 
rates (%) are averaged over 50 images in GrabCut dataset. 
KemelCut-Gau-NC means KernelCut for fixed width Gaussian 
kernel based normalized cut objective. 


better than model fitting. Also, the performance of our method is 
less dependent on regularization weight and does not require fine 
tuning of 7. 

6.2.2 Segmentation on GrabCut & Berkeley datasets. 

First, we report results on the GrabCut database (50 images) using 
the bounding boxes provided in [97]. For each image the error 
is the percentage of mis-labeled pixels. We compute the average 
error over the dataset. 

We experiment with four variants of our Kernel Cut, depending 
on whether to use fixed width Gaussian kernel or KNN kernel, and 
also the choice of normalized cut or average association term. We 
test different smoothness weights and plot the error curves^^ in 
Fig. 19. Table 6 reports the best error for each method. For contrast- 
sensitive regularization GrabCut gets good results (8.2%). How¬ 
ever, without edges (Euclidean or no regularization) GrabCut gives 
much higher errors (13.6% and 27.2%). In contrast, KemelCut- 
KNN-AA (Kernel Cut with adaptive KNN kernel for A A) gets 
only 12.2% doing a better job in color clustering without any help 
from the edges. In case of contrast-sensitive regularization, our 
method outperformed GrabCut (7.1% vs. 8.2%) but both methods 
benefit from strong edges in the GrabCut dataset. Fig .20 shows 
that our Kernel Cut is also robust to the hyper-parameter, i.e. iT 
for nearest neighbours, unlike GrabCut. 

Figure 21 gives some sample results. The top row shows a 
failure case for GrabCut where the solution aligns with strong 
edges. The second row shows a challenging image where our 
KernelCut-KNN-A A works well. The third and fourth rows show 

15. The smoothness weights for different energies are not directly compara¬ 
ble; Fig. 19 shows all the curves for better visualization. 



Fig. 16: Segmentation using our kernel cut with label cost. We 
experiment with increasing value of label cost for each label 
(from left to right) 


boundary smoothness 

color clustering term 

r- K/- ^ KernelCut 
BJ GrabCut .^NN-AA 

none 

12.4 

12.4 

7.6 

contrast-sensitive 

3.2 

3.7 

2.8 


TABLE 7: Seeds-based interactive segmentation (Fig.22). Error 
rates (%) are averaged over 82 images from Berkeley database. 
Methods get the same seeds entered by four users. We removed 
18 images with multiple nearly-identical objects (see Fig. 24) from 
100 image subset in [98]. (GrabCut and KernelCut-KNN-A A give 
3.8 and 3.0 errors on the whole database.) 


failure cases for fixed-width Gaussian kernel in kernel cut due to 
Breiman’s bias (Th.5) where image segments of uniform color are 
separated; see green bush and black suit. Adaptive kernel (KNN) 
addresses this bias. 

We also tested seeds-based segmentation on a different 
database [98] with ground truth, see Tab.7 and Fig.22. 

6.2.3 Segmentation of similar appearance objects 
Even though objects may have similar appearances or look similar 
to the background (e.g. the top row in Fig. 24), we assume that 
the objects of interest are compact and have different locations. 
This assumption motivates using XY coordinates of pixels as 
extra features for distinguishing similar or camoufiaged objects. 
XY features have also been used in [99] to build space-variant 
color distribution. However, such distribution used in MRF-MAP 
inference [99] would still over-fit the data [41]. Let Ip e IZ^ be 
the augmented color-location features Ip = [Ip^ap^bp^ flxp^ fli/p] 
at pixel p where [Ip^ap^bp] is its color, [xp^pp] are its image 
coordinates, and /3 is a scaling parameter. Note that the edge-based 
Potts model [4] also uses the XY information. Location features in 
the clustering and regularization terms have complementary effect: 
the former solves appearance camouflage while the latter gets edge 
alignment. 

We test the effect of adding XY into feature space for GrabCut 
and Kernel Cut. We try various p for Kernel Cut. Fig. 23 shows 
the effect of different p on KNNs of a pixel. For histogram-based 
GrabCut we change spatial bin size for the XY channel, ranging 
from 30 pixels to the image size. We report quantitative results on 
18 images with similar objects and camoufiage from the Berkeley 
database [100]. Seeds are used here. Fig. 25 shows average errors 
for multi-object dataset, see example segmentations in Fig. 24. 

Fig. 26 gives multi-label segmentation of similar objects in one 
image with seeds using our algorithm. We optimize kernel bound 
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Fig. 17: Incorporating group prior achieves better NMI for image 
clustering. Here we use tags-based group prior. Our method 
achieved better NMI when more images are tagged. The right plot 
shows how the weight of bin consistency term affects our method. 
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Fig. 18: Illustration of robustness to smoothness weight. 


with move-making for NC and smoothness term combination, as 
discussed in Sec. 2.2. Fig. 26 (c) shows energy convergence. 

6.2.4 Texture segmentation 

The goal of this experiment is to demonstrate scalability of 
our methods to highly dimensional data. First, desaturated im¬ 
ages from GrabCut database [26] are convolved with 48 filters 
from [101]. This yields a 48-dimensional descriptor for each 
pixel. Secondly, these descriptors are clustered into 32 textons 
by K-means. Thirdly, for each pixel we build a 32-dimensional 
normalized histogram of textons in 5 x 5 vicinity of the pixel. 
Then the gray-scale intensity^^ of a pixel is augmented by the 
corresponding texton histogram scaled by a factor w. Finally, re¬ 
sulting 33-dimensional feature vectors are used for segmentation. 
We show the result of Kernel Cut with respect to w in Fig. 27. 

16. We found that for the GrabCut database adding texture features to RGB 
does not improve the results. 




Fig. 19: Average error vs. regularization weights for different 
variants of our KernelCut on the GrabCut dataset. 


Number of Bins 

16 32 64 128 256 



Fig. 20: Our method aKKM is robust to choice of K while 
GrabCut is sensitive to bin size for histograms. 


We compare our results with GrabCut with various bin sizes for 
texture features. 

6.2.5 Interactive RGBD Images Segmentation 

Depth sensor are widely used in vision for 3D modelling [103], 
[104], semantic segmentation [105], [106], [102], [107], motion 
flow [108]. We selected 64 indoor RGBD images from seman¬ 
tic segmentation database NYUv2 [102] and provided bounding 
boxes and ground truth. In contrast to [26], the prepared dataset 
consists of low-quality images: there are camera motion artifacts, 
underexposed and overexposed regions. Such artifacts make color- 
based segmentation harder. 
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Fig. 28: RGBD+XY examples. The first two rows show original images wit bounding box and color-coded depth channel. The third 
row shows the results of Grabcut, the forth row shows the results of Kernel Cut. Parameters of the methods were independently selected 
to minimize average error rate over the database. The parameters of the algorithms were selected to minimize the average error over 
the dataset. 
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Fig. 21: Sample results for GrabCut and our kernel cut with fixed 
width Gaussian or adaptive width KNN kernel, see Tab. 6. 

We compare GrabCut to Kernel Cut over joint features Ip = 
[I/p, ap^bp^ PDp] as in Sec. 6.2.3. Fig. 28 shows the error statistics 
and segmentation examples. While Kernel Cut takes advantage of 
the additional channel, GrabCut fails to improve. 

6.2.6 Motion segmentation 

Besides the location and depth features, we also test segmentation 
with motion features. Figs. 30, 31 and 32 compare motion seg¬ 
mentations using different feature spaces: RGB, XY, M (optical 
flow) and their combinations (RGBM or RGBXY or RGBXYM). 
Abbreviation +XY means Potts regularization. We apply kernel 
cut (Alg.l) to the combination of NC with the Potts term. 


seeds BJ GrabCut aKKM 



Fig. 22: Sample results for BJ [4], GrabCut [26], and our kernel 
cut for adaptive KNN kernel, see Tab.7. 


Fig. 23: Visualization of a pixel’s K-Nearest-Neighbours for RGB 
feature (left) or RGBXY feature (right). 

Challenging video examples: For videos in FBMS-59 dataset 
[109], our algorithm runs on individual frames instead of 3D 
volume. Segmentation of previous frame initializes the next frame. 
The strokes are provided only for the first frame. We use the optical 
flow algorithm in [90] to generate M features. Selected frames are 
shown in Figs. 30 and 31. Instead of tracks from all frames in 
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(a) seeds (b) ground truth (c) GrabCut (d)Kemel Cut 
Fig. 24: Sample results using RGBXY+XY. 


Bin size in pixels 



Fig. 25: Error on Multi-objects dataset. We vary spatial bin-size 
for GrabCut and weight p in [/, a, 6,for Kernel Cut. 
The connection range is the average geometric distance between 
a pixel and its nearest neighbor. The right-most point of the 
curves corresponds to the absence of XY features. GrabCut 
does not benefit from XY features. Kernel Cut achieves the best 
error rate of 2.9% with connection range of 50 pixels. 


[110], our segmentation of each frame uses only motion estimation 
between two consecutive frames. Our approach jointly optimizes 
normalized cut and Potts model. In contrast, [110] first clusters 
semi-dense tracks via spectral clustering [109] and then obtains 
dense segmentation via regularization. 

Kitti segmentation example: We also experiment with Kitti 
dataset [111]. Fig. 32 shows the multi-label segmentation using ei¬ 
ther color information RGB-i-XY (first row) or motion MXY-i-XY 
(second row). The ground-truth motion field works as M channel. 
Note that the motion field is known only for approximately 20% of 
the pixels. To build an affinity graph, we construct a KNN graph 
from pixels that have motion information. The regularization over 
8-neighborhood on the pixel grid interpolates the segmentation 
labels during the optimization procedure. 


APPENDIX A (Weighted KM and AA) 


This Appendix reviews weighted kernel K-means (wkKM) in 
detail. In particular, it describes generalizations of KM procedures 
(23) and (24) to the weighted case and shows that they also cor¬ 
respond to linear bound optimization by extending Theorem 1 in 
Sec. 2.1. We provide an alternative derivation for the kernel bound 
for NC. The Appendix also discusses equivalence of weighted A A 
with arbitrary affinity to wkKM with p.s.d. kernel and explains the 
corresponding diagonal shift. 
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(c) Energy minimization for NC plus smoothness 
Eig. 26: Multi-label segmentation for similar objects. 


Number of bins in extra dimensions 

2 4 6 8 10 



Eig. 27: The average errors of GrabCut a Kernel Cut methods for 
texture segmentation over 50 desaturated images from GrabCut 
database [26]. We optimize the result of GrabCut with respect 
to smoothness weight and bin sizes in the intensity dimension. 
We optimize the result of Kernel Cut with respect to smoothness 
weight. 


Weighted kernel K-means (wkKM): As discussed in Sec¬ 
tion 1.2.2, weighted K-means corresponds to objective (36) 


F^{S) = 


E E wp\\(i)p - usk 

k 


-E 


S^'WJCWS^ 

S^'w 


(A-1) 

(A-2) 


where ||.|| is the Euclidean norm, w := {wp\p e are predefined 
weights, (pp = (p(Ip) is an embedding of data points in some high¬ 
dimensional space, and is a weighted cluster mean (33). Con¬ 
sistently with Sec. 1.2.2 we use diagonal matrix W = diag(w) 
and embedding matrix p \= [^pp] implying identities (34) and 
matrix formulation (A-2) with p.s.d. kernel 


K = p'p 


of dot products ICpq = p'ppq. The constant connecting equivalent 
objectives (A-1) and (A-2) is EpeO 
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(a) frames (b) optical flow [90] (c) M+XY (d) RGB+XY (e) RGBM+XY 

Fig. 30: Motion segmentation using our framework for the sequence horsesOl in FBMS-59 dataset [109]. Motion feature alone (M+XY 
in (c)) is not sufficient to obtain fine segmentation. Our framework successfully utilize motion feature (optical flow) to separate the 
horse from the bam, which have similar appearances. 
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(a) frames (b) optical fiow [90] (c) RGBXY+XY (d) RGBXYM+XY 


Fig. 31: Multi-label motion segmentation using our framework for the sequence ducksOl in FBMS-59 dataset [109]. This video is 
challenging since the ducks here have similar appearances and even spatially overlap with each other. However, different ducks come 
with different motions, which helps our framework to better separate individual ducks. 
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Fig. 32: Motion segmentation for image 000079_10 from KlTTl [111] dataset. The first row shows the motion fiow. Black color codes 
the pixels that do not have motion information. The second row shows color-based segmentation. The third row shows motion based 
segmentation with location features. We also tried M-i-XY segmentation, but it does not work as well as MXY-i-XY above. The results 
for RGBMXY-i-XY were not significantly different from MXY-i-XY. 


Number of bins in depth dimension 
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Fig. 29: The average errors of GrabCut a Kernel Cut methods over 
64 images selected from NYUv2 database [102]. 


In the context of weighted energy (A-1) the basic KM algo¬ 
rithm [28] is the block-coordinate descent for mixed objective (32) 

F^{S,m) ■■= Wp\\(j)p-mkf 

k 

= -2T^</>'TOfe) s’" (A-3) 

k 

where the second linear algebraic formulation^^ generalizes (49) 
and highlights modularity (linearity) with respect to S. Variables 
rrik can be seen as “relaxed” segment means in (A-1). 
Yet, energies (A-3) and (A-1) are equivalent since their global 
minimum is achieved at the same optimal segmentation S. Indeed, 

IJ-Sk = argmin Y, u>p\\(l)p - nikf 

=> = minF"’(5,TO). (A-4) 

m 

Weighted KM and bound optimization: The weighted case 
of procedure (23) replaces ji^k (17) by weighted mean (33) 

(T-ST) S- - (A-5, 

17. It is obtained by opening the square of the norm and applying algebraic 
identities (34). Formulation (6) omits the same constant as (A-2). 
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Assuming <pp e 7V^ each iteration’s complexity 0(\Q\Km) is 
linear with respect to the number of data points \Q\. 

Implicit /cKM procedure (24) generalizes to the weighted case 
as follows. Similarly to Sec.8.2.2 in [51] and our derivation of (24) 
in Sec. 1.2.2, the square of the objective in (A-5) and (33) give 




^ cpp'cpWS^ ^ SfW(f>'(f>WS^ 

Sfw 


Since the crossed term is a constant at p, the right hand side gives 
an equivalent objective for computing Sp in (A-5). Using JC = 
and indicator vector Ip for element p we get generalization of (24) 


Pp'KWS^ _ 
S^'w 
(A-6) 

In contrast to procedure (A-5), this approach has iterations of 
quadratic complexity 0{\Qf). However, it avoids the explicit use 
of high-dimensional embeddings <pp replacing them by the kernel 
matrix JC in all computations, a.k.a. the Jcernel trick. 

Generalizing Theorem 1 in Sec. 2.1 we can interpret weighted 
KM procedures (A-5,A-6) as linear bound optimization. 


/ weighted \ 
/cKM 
\ procedure / 




s. 


arg mm - 
k 


s^'wicwsj 
(sfwr 


Theorem 6 (bound for (35,A-1)). Weighted KM procedures (A-5) 
or (A-6) can be seen as bound optimization methods for weighted 
K-means objective F'^(S) (35) or (A-1) using auxiliary function 


at{S) (A-7) 


Then, (A-9) implies the following linear bound for NC 



that agrees with the kernel bound for NC in Table 2. 

Equivalence of weighted AA and AD to wkKM: Figure 33 
extends Figure 3 by relating weighted generalizations of standard 
pairwise clustering objectives to wkKM. The equivalence of 
objectives in Figure 33 can be verified by simple algebra. One 
additional simple property below is also needed. It is easy to prove. 

Proposition 1. (e,g, Roth et al. [38]) For any symmetric matrix 
M define 

M:=M + 6I 

where I is an identity matrix. Then, matrix M is positive semi- 
definite (psd) for any scalar S > -Ao(M) where Aq is the smallest 
eigenvalue of its argument. 

APPENDIX B (Weak kernel K-means) 

For Hilbertian distortions ||||^ = ||||^ with p.s.d. kernels we can 
show that pairwise kKM approach (21) is “stronger” than a 
pointwise pKM approach (13) using the same metric. In this case 
pKM can be called weak kernel K-means, see Figure 34. Equiv¬ 
alent /cKM formulation (18) guarantees more complex decision 
boundaries, see Fig. 1(h), compared to pKM energy (13) 


at any current segmentation St = {Sf} with means pf = {p^k}- 

Proof. Indeed, (A-4) implies at{S) > F^{S). Since at {St) = 
F'^{St) then at{S) is a proper bound for F'^{S). Re¬ 
segmentation step (A-5) produces optimal segments St+i 
minimizing the bound at{S). Re-centering step minimizing 
F'^(St+i^m) for fixed segments produces means pf+i defining 
the bound at+i (S) for the next iteration. □ 


Since algebraic formulations (A-3) and (A-2) omit the same 
constant we also get the following Corollary. 


Corollary 1 (bound for (3 6, A-2)). Weighted KM procedures 
(A-5) or (A-6) can be seen as bound optimization methods for 
wkKM objective F'^{S) in (A-2) using auxiliary function 

at{S) := S’^ (A-8) 


- E 

k 




/ S^WICW 

\ (w'S’^y 


^ 2 y 

w'Sjj j 


(A-9) 


at any current segmentation St •= {5'^} with means pf := {p^k }• 


Proof. The first expression follows from Th. 6 and formula (A-3) 
for F^{S,m) at m = pf. Also, (33) and JC = imply the 
second expression for bound at{S). □ 


The second expression for at{S) in Corollary 1 allows to 
obtain a linear bound for NC objective (41). For simplicity, assume 
positive definite affinity A. As follows from [52], [40], [58] and 
a simple derivation in our Sec. 1.3.1, normalized cut (41) with 
p.d. affinity A is equivalent to wkKM (36, A-2) with weights and 
kernel in (42) 

IC = W~^AW~^. 


^ ^ \\Ip-mk\\l (B-1) 

k psS^ 

^ E E 

k peS^ 

with isometric kernel distance \\\\j^ and some point ruj^ in the 
original space. Fig. 1(g). Indeed, any rrik in the original space 
corresponds to some search point (j){mk) in the high-dimensional 
embedding space, while the opposite is false. Thus, optimization 
of (21) and (18) has larger search space than (B-1). It is also easy 
to check that energy (B-1) is an upper bound for (21) at any S. For 
this reason we refer to distortion energy (B-1) with kernel distance 
IIII/e and explicit mj^ in the original space as weak kernel K-means. 
Pointwise energy (B-1) is an example of pKM (13), while pairwise 
energy (21) with the same kernel metric is kKM. 

Note that weak kernel K-means (B-1) for Gaussian kernel 
corresponds to K-modes closely related to mean-shift [114], [39], 
as discussed below. Some results comparing K-modes (weak 
kKM) to regular kKM are shown in Fig.l(g,h). Figure 34 illustrates 
general relations between kernel K-means (21) and probabilistic 
K-means (13,14). It includes a few examples discussed earlier and 
some more examples discussed below. 

K-modes and mean-shift: Weak kernel K-means using unary 
formulation (B-1) with kernel distance || \\k and explicit optimiza¬ 
tion of rrik in the original data space is closely related to K-modes 
approach to clustering continuous [30] or discrete [115], [116] 
data. For example, for Gaussian kernel distance || \\k energy (B-1) 
becomes 



k p^S^ 


or, using Parzen densities Va{ ■ I^S^) for points {Ip\p ^ S^}, 
k 


W = d:= Al 


and 


(B-2) 
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NOTATION: 

(point weigh 
w := {wp} r:={l/wp} (vectors) 

W := diag{w) (diagonal matrix) 

r' r 1 1 1 

(matrix) 
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2Wq^ 


average _ v wpWgApg 
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diag{W) = W-^ 
did^{W) := W-diag{W) 
= W-W-^ 


kernel 
K-means 

Y. wpUp- 

k pes^ 

embedding 0 in Hilbert space 



NOTE : c.p.d. A will give 
non-negative distortions Dpq 
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T^pq^Sk WpWqD. 


pq 


- E 

k 

= -E 


2{w, S^) 


2 {w,S>‘) 

{w, S’^) 

correspond to Hilbertian metric and p.d. kernel 
in the original data space 


pair-wise distortions Dpq 

assuming zero diagonal Dpp = 0 
but symmetry and non-negativity are not needed 
[Roth, PAMI’03] 


Fig. 33: Equivalence of kernel K-means (/cKM), average distortion (AD), average association (AA) for the general case of weighted 
points. Typically, kKM is associated with positive semi-definite (p.s.d.) or conditionally positive-definite (c.p.d.) kernels [112] and 
Hilbertian distortions [47]. The above formulations of AA and AD make no assumptions for association matrix A and distortions D 
except for zero diagonal in D. Then, equivalence of AD and AA objectives (up to a constant) is straightforward. Roth et al. [38] 
reduce non-weighted case of AD to kKM. For arbitrary D they derive Hilbertian distortion || ||^ with an equivalent AD energy (up to 
a constant) and explicitly construct the corresponding embedding fi. We show Hilbertian metric || ||^ for the general weighted case of 
AD, see AD^/cKM above. Dhillon et al. [40], [58] prove that normalized cuts is a special case of weighted /cKM and construct the 
corresponding p.d. kernel. Sec. 1.3.1 shows a simpler reduction of normalized cuts to weighted /cKM. Similarly to [38], an equivalent 
p.d. kernel could be constructed for any association matrix A, see AA^/cKM above. Note that the formulas for ^d-equivalent p.d. kernel 
and 17-equivalent Hilbertian metric require some sufficiently large diagonal shift 5. Roth et al. [38] relate proper S to the smallest 
eigenvalue of A = - y. Our weighted formulation requires the eigenvalue of diag(yV)~^ • A • diag(W)~^. 
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Fig. 34: Clustering with (A) pointwise and (B) pairwise distortion energies generalizing (13) and (26) for points with weights w = {wp}. 
Pointwise distortion relates a data point and a model as log-likelihood function ||/p - ^||^ = - In 7^(Ip|^). Pairwise distortion is defined 
by matrix Dpq = \\Ip- Iq \\d- Weighted AD or AA for arbitrary metrics are equivalent to weighted /cKM, see Figures 3 and 33. As shown 
in [38], [40] average cut, normalized cut [8], and spectral ratio cut [113] are examples of (weighted) /cKM. 
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Clearly, optimal rrik are modes of Parzen densities in each 
segment. K-modes objective (B-2) can be seen as an energy- 
based formulation of mean-shift clustering [114], [39] with a 
fixed number of clusters. Formal objective allows to combine 
color clustering via (B-2) with geometric regularization in the 
image domain [29]. If needed, the number of clusters (modes) can 
be regularized by adding label cost [12]. In contrast, mean-shift 
segmentation [39] clusters RGBXY space combining color and 
spatial information. The number of clusters is indirectly controlled 
by the bandwidth. 

Note that K-modes energy (B-2) follows from a weak /cKM 
approach (B-1) for arbitrary positive normalized kernels. Such 
kernels define different Parzen densities, but they all lead to energy 
(B-2) where optimal rrik are modes of the corresponding densities. 
Therefore, different kernels in (B-1) give different modes in (B-2). 

Many optimization methods can be used for K-modes energy. 
For example, it is possible to use iterative (block-coordinate 
descent) approach typical of K-means methods: one step reclusters 
points and the other step locally refinement the modes, e.g. using 
mean-shift operation [29]. For better optimization, local refine¬ 
ment of the mode in each cluster can be replaces by the best 
mode search tracing all points within each cluster using mean- 
shift. RANSAC-like sampling procedure can be used for some 
compromise between speed and quality. It is also possible to use 
exhaustive search for the strongest mode in each cluster over 
observed discrete features and then locally refine each cluster’s 
mode with mean-shift. 

It is also interesting that discrete version of K-modes for 
histograms [115], [116] define modes rrik = (m^,..., ...) 

combining marginal modes for all attributes or dimensions j. 
Implicitly, they use distortion \\\\f^ for discrete kernel k{x^y) = 
= y^] where [•] are Iverson brackets. Marginal modes 
could be useful for aggregating sparse high-dimensional data. 

Analogously, we can also define a continuous kernel for 
marginal modes as 

k{x,y) = Yje . ( 3 - 3 ) 

3 


Note that this is different from the standard Gaussian kernel 


Wx-vV 


= n e 2.2 ^ 


which leads to regular modes energy (B-2). It is easy to check that 
kernel (B-3) corresponds to weak /cKM energy 


-EE E 
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= -Eiffel-E^iKI^') 
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where is a marginal Parzen density for dimension j. 


APPENDIX C (Entropy Clustering) 

First, we show that (14) reduces to (15) for descriptive models. 
Indeed, assume V{0) = P{'\0) is a continuous density of a 
sufficiently descriptive class (e.g. GMM). For any function f{x) 
Monte-Carlo estimation gives for any subset S cQ 

E f(k) HS| • f f{x)d4x)dx E |S| • if,d.) 

peS ^ 



Image and intensity histogram Solution A Solutions 



Fig. 35: Histograms in color spaces. Entropy criterion (15) with 
histograms can not tell a difference between A and B: bin permu¬ 
tations do not change the histogram’s entropy. 


where ds is a “true” density for intensities {Ip\p ^ S} and (,) is a 
dot product. If / = - log7^(6>s) and ds ^V(Os) then (14) implies 
(15) for differential entropy H(S) := H{v\0s)). For histograms 
Vh{^) = Vh(-\^) entropy-based interpretation (15) of (14) is 
exact for discrete entropy 

if(S):=-^P^(a;|S)-logP^(x|S)E-<P;.(S),logn(S)). 

X 

Intuitively, minimization of the entropy criterion (15) favors 
clusters with tight or “peaked” distributions. This criterion is 
widely used in categorical clustering [34] or decision trees [35], 
[36] where the entropy evaluates histograms over “naturally” 
discrete features. Below we discuss limitations of the entropy 
clustering criterion with either discrete histograms or continuous 
GMM densities in the context of color feature spaces. 

The case of histograms: In this case the key problem for 
color space clustering is illustrated in Fig.35. Once continuous 
color space is broken into bins, the notion of proximity between 
the colors in the nearby bins is lost. Since bin permutations do not 
change the histogram entropy, criterion (15) can not distinguish 
the quality of clusterings A and B in Fig.35; some permutation of 
bins can make B look very similar to A. 

The case of GMM densities: In this case the problem of 
entropy clustering (15) is different. In general, continuous density 
estimators commonly use Gaussian kernels, which preserve the 
notion of continuity in the color space. Indeed, the (differential) 
entropy for any reasonable continuous density estimate will see a 
significant difference between the clusters in A and B, see Fig.35. 

We observe that the main issue for entropy criterion (15) 
with GMM densities is related to optimization problems. In this 
case high-order energies (15) or (14) require joint optimization of 
discrete variables Sp and a large number of additional continuous 
parameters for optimum GMM density 7^(j^s)- That is, the use 
of complex parametric probability models leads to complex high- 
order mixed objective functions. Typical block coordinate descent 
methods [23], [26] iterating optimization of S and 0 are sensitive 
to local minima, see Figures 2 and 1(e). Better solutions like 
Figure 1(f) have lower energy, but they can not be easily found 
unless initialization is very good. 

These problems of pKM with histograms or GMM may ex¬ 
plain why descriptive model fitting is not common in the learning 
community for clustering high-dimensional continuous spaces. In¬ 
stead of pKM they often use a different extension of K-means, that 
is kernel K-means (/cKM) or related pairwise clustering criteria 
like Normalize Cut (NC), see Sec. 1.2.2 and 1.3.1. 
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APPENDIX D (Pseudo Bound Lemma) 

Lemma 2. Consider function e : {0,1}^^^ ^ defined by any 
symmetric matrix A and (strictly) positive vector w as 

X'AX 


e{X) = 


w'X 


Function Tt is a pseudo-bound ofe{X) at Xtfor W := diag(w) 

'WX 


Tt(X,6) := VeiXtYX + J 


((l-2XtyWX \ 
[ w'X, 


+1 (D-1) 


Ve^(Xt) = Ve(Xt)-(5 
= VeiXt)+S 


{w'Xty 

W{l-2Xt) 

w'Xt 


Note that conditional density ds in (E-1) can be written as 

[x e S] 


ds(x) = dn{x )■ 


w 


(E-3) 


where Ve(X) = w ^,^2 - Furthermore^ Tt{X,6) 

is an auxiliary function for e(X) for all S > -Ao(VE^ AW ) 
where Aq denotes the smallest eigenvalue of the matrix. 

Proof Diagonal shift 5W for matrix A defines function 

^ w'X ^ ^ w'X 

According to Lemma 1 function is concave for any (5 > -Aq 
since (6W + A) is p.s.d. for such 6. Thus, (53) defines a Taylor- 
based linear upper bound for for > -Ao 

VegiXtYX. 

We have e^{X) = e(X) - 5 for Boolean X where X'WX = 
w'X. Thus, the following upper bound is valid for optimizing 
e(X)overX£{0,l}l^lat5>-Ao 

Tt{X,5) = SJe^iXt)'X + 5. (D-2) 

where the definition of above yields gradient expression 

,2 (MfiXf) WXt - U^^W^)w 


where [•] is an indicator function. Eqs. (E-3) and (E-2) give 
L{S) = — J dQ(x)[x e S]dx-Y Y —— J dQ{x)[x e S]dx. 
Introducing notation 

/ := [x € S] and F := dQ(x) 
allows to further rewrite objective function L(S) as 
EIF EF(1-/) 


L(S) = 


El 1-EI 


Without loss of generality assume that (the 

opposite case would yield a similar result). We now need the 
following lemma. 

Lemma 3. Let a^b^c^d be some positive numbers, then 


a c 
b ~ d 


a a + c c 

- < - < 

b b + d d 


Proof Use reduction to a common denominator. □ 

Lemma 3 implies inequality 

T\ F^.FT 

(E-4) 


EF(1-/) EFI 

< EF < 


1-EI - -El 
which is needed to prove the Proposition below. 

Proposition 2. (Gini-bias) Assume that subset c 12 A 

Ss ■= {x : dQ^{x) > snpdQ(x) -e}. (E-5) 


Then 


where W1 = w for matrix W = diag{w) and X[WXt = w'Xt 
for any Xt since all iterations explore only Boolean solutions. □ 

APPENDIX E (PROOF OF G/n/S/as THEOREM 5) 

Let do^ be a continuous probability density function over domain 
12 ^ VX defining conditional density 

ds{x) dQ(x\x G S) (E-1) 

for any non-empty subset S c 12 and expectation 

E-/ z(x)dQ(x)dx 

for any function z : Q ^ IZ^. 

Suppose 12 is partitioned into two sets S and S such that S u 
S = 12 and S n S = 0. Note that S here and in the statement of 
Theorem 5 is not a discrete set of observations, which is what S 
means in the rest of the paper. Theorem 5 states a property of a 
fully continuous version of Gini criterion (78) that follows from 
an additional application of Monte-Carlo estimation allowing to 
replace discrete set cardinality |S| by probability u; of a continuous 
subset S 

w := J dQ(x)dx = J dQ(x) ■ [x e S]dx = E[x e S]. 

Then, minimization of Eg(^S) in (78) corresponds to maximiza¬ 
tion of the following objective function 

L{S):=w J ds^(x)dx + (1 - w) J ds^(x)dx. (E-2) 


supL(S) = limL(S£) = Ed^ + supdQ(x). (E-6) 
s 


Proof Due to monotonicity of expectation we have 
EFI E{Isup^dQ(x)) 


El " El - SUVdnix). (E-7) 

Then (E-4) and (E-7) imply 

EFI EF(l-I) 

L{S) = ^ ’ < snpdaix) + EF. (E-8) 

That is, the right part of (E-6) is an upper bound for I/(S). 

Let Is = [x G Se]. It is easy to check that 

£-0 1 - EF 

Definition (E-5) also implies 

1™ . 1™ E(sup^ dn{x) - e)Ie 


(E-9) 


lim 


^ > lim - 

0 Els 


Els 


= suy)dQ(x). 


This result and (E-7) conclude that 

=supdQ{x). 

Els X 

Einally, the limits in (E-9) and (E-10) imply 

EF(l-F) EFF 

\imL(Ss) = lim- ^ + lim ^ 

^^0 ^ £-0 1 -EIs Els 

= Edn + snpdQ(x). 

X 

This equality and bound (E-8) prove (E-6). 


(E-10) 


□ 
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